arxiv: 2601.03294 · v2 · submitted 2026-01-05 · 💻 cs.CR · cs.AI

AgentMark: Utility-Preserving Behavioral Watermarking for Agents

Kaibo Huang , Jin Tan , Yukun Wei , Wanling Li , Zipei Zhang , Hui Tian , Zhongliang Yang , Linna Zhou This is my paper

Pith reviewed 2026-05-16 17:50 UTC · model grok-4.3

classification 💻 cs.CR cs.AI

keywords behavioral watermarkingLLM agentsutility preservationblack-box accessplanning decisionsmulti-bit identificationprovenance tracking

0 comments

The pith

AgentMark embeds traceable multi-bit identifiers into the planning choices of LLM agents by sampling from their natural behavior distributions, preserving task performance even when only black-box API access is available.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents AgentMark as a way to mark the high-level planning behaviors of autonomous agents, such as tool selections and subgoal choices, so that the source or owner can later be identified from execution traces. It achieves this by first prompting the agent to surface an explicit distribution over possible behaviors, then drawing samples from that distribution in a controlled way that encodes the identifier bits while keeping the overall probabilities unchanged. Because the method works through ordinary API calls and does not require internal model changes, it applies to existing deployed agents. Experiments across physical, tool-using, and conversational settings show that the embedded marks can be recovered from incomplete logs and that agents complete tasks at the same rate as unmarked versions.

Core claim

AgentMark elicits an explicit behavior distribution from the agent and then applies distribution-preserving conditional sampling to insert multi-bit identifiers directly into planning decisions. This keeps the marginal action probabilities identical to the original agent, avoiding the compounding utility loss that occurs when planning distributions are altered, and it functions under black-box API constraints while remaining compatible with separate action-layer content watermarking.

What carries the argument

Distribution-preserving conditional sampling that conditions the agent's own elicited behavior distribution on the watermark bits so the encoded identifier is carried in the sequence of planning choices without shifting the overall distribution.

If this is right

Watermarked agents remain attributable to their origin from partial execution logs across embodied, tool-use, and social environments.
The behavioral watermark can be layered with existing content watermarking applied to the agent's final outputs or actions.
Multi-bit capacity allows richer identification information than single-bit marks while still recovering reliably from incomplete traces.
The same sampling approach supports deployment on agents accessed only through public APIs without retraining or white-box intervention.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Regulators could require such marks on deployed agents to trace responsibility for autonomous decisions in safety-critical domains.
The technique might extend to non-LLM sequential planners if their decision distributions can be elicited similarly.
Watermark survival under subsequent fine-tuning or policy updates would need separate verification beyond the current experiments.

Load-bearing premise

That drawing samples from the elicited behavior distribution to encode the watermark will not produce accumulating deviations that degrade performance over long sequences of decisions.

What would settle it

Compare success rates and efficiency metrics on identical long-horizon tasks between watermarked and unmarked agents; any consistent drop in the watermarked case would falsify the utility-preservation claim.

Figures

Figures reproduced from arXiv: 2601.03294 by Hui Tian, Jin Tan, Kaibo Huang, Linna Zhou, Wanling Li, Yukun Wei, Zhongliang Yang, Zipei Zhang.

**Figure 2.** Figure 2: AgentMark overview. At each round, the agent would otherwise make an implicit planning-behavior [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: OASIS social-quality utility and detectability. [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

**Figure 4.** Figure 4: Both unwatermarked and wrong-key FPRs decay as 2 −k against the overhead k. 0.0 0.5 1.0 Erasure Probability (p) 0.0 0.2 0.4 0.6 0.8 1.0 Decode Success Rate Overall Robustness 0.85 0.90 0.95 1.00 Erasure Probability (p) 0.0 0.2 0.4 0.6 0.8 1.0 Phase Transition Zone RLNC Global RLNC Single Repetition Global Repetition Single [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

**Figure 5.** Figure 5: Robustness to Step Erasure and Truncation. [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

**Figure 6.** Figure 6: Differential-based recombination slices the [PITH_FULL_IMAGE:figures/full_fig_p011_6.png] view at source ↗

read the original abstract

LLM-based agents are increasingly deployed to autonomously solve complex tasks, raising urgent needs for IP protection and regulatory provenance. While content watermarking effectively attributes LLM-generated outputs, it fails to directly identify the high-level planning behaviors (e.g., tool and subgoal choices) that govern multi-step execution. Critically, watermarking at the planning-behavior layer faces unique challenges: minor distributional deviations in decision-making can compound during long-term agent operation, degrading utility, and many agents operate as black boxes that are difficult to intervene in directly. To bridge this gap, we propose AgentMark, a behavioral watermarking framework that embeds multi-bit identifiers into planning decisions while preserving utility. It operates by eliciting an explicit behavior distribution from the agent and applying distribution-preserving conditional sampling, enabling deployment under black-box APIs while remaining compatible with action-layer content watermarking. Experiments across embodied, tool-use, and social environments demonstrate practical multi-bit capacity, robust recovery from partial logs, and utility preservation. The code is available at https://github.com/Tooooa/AgentMark.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

AgentMark gives a workable way to watermark agent planning via distribution-preserving sampling, but the experiments are too light on numbers to confirm utility holds over long runs.

read the letter

Hi, the main thing here is that the paper moves watermarking from final outputs to the planning layer for LLM agents. They elicit a behavior distribution from the black-box model and apply conditional sampling to insert multi-bit identifiers while keeping the distribution intact, which should prevent utility drops from small changes compounding across steps. This is new relative to content-only methods and stays compatible with them, plus it works for partial log recovery. The experiments span embodied, tool-use, and social settings and they released the code, which is useful for checking the implementation. What stands out is the focus on black-box deployment and the practical framing of the compounding problem. The soft spots are in the evidence. The abstract claims utility preservation and robust recovery but gives no effect sizes, baselines, or variance numbers, so it's difficult to judge how well the sampling actually avoids degradation. The stress-test point about elicitation approximations or small conditioning biases accumulating in long-horizon tasks is reasonable, and without extended sequence tests or sensitivity checks the central claim stays provisional. This is for researchers working on agent safety, provenance, or regulatory compliance who already deal with watermarking. A reader looking for new techniques in this area would get the core idea and some initial validation, though they'd need the full methods for real use. It deserves peer review to see the quantitative details and any additional runs that address accumulation.

Referee Report

2 major / 1 minor

Summary. The paper proposes AgentMark, a behavioral watermarking framework for LLM-based agents that embeds multi-bit identifiers into planning decisions by eliciting an explicit behavior distribution from the agent and applying distribution-preserving conditional sampling. This enables black-box API deployment while claiming to preserve utility and remain compatible with action-layer watermarking. Experiments across embodied, tool-use, and social environments are reported to demonstrate practical multi-bit capacity, robust recovery from partial logs, and utility preservation.

Significance. If the central claim of utility preservation holds without compounding bias over long horizons, the work would address a key gap in attributing high-level agent behaviors for IP protection and regulatory provenance, extending watermarking beyond token-level outputs to planning layers in autonomous systems.

major comments (2)

[Abstract / Experiments] Abstract and Experiments section: the claim of utility preservation under compounding deviations is asserted but supported only by the statement that experiments demonstrate it, with no quantitative details on effect sizes, baselines, failure modes, or long-horizon sequence lengths provided, preventing verification of the no-accumulation condition.
[Method] Method description: the approach of eliciting a behavior distribution via black-box queries and then applying conditional sampling is presented without variance bounds, sensitivity analysis, or explicit checks that small elicitation or conditioning deviations do not accumulate across sequential planning steps, which is load-bearing for the utility-preservation claim.

minor comments (1)

[Abstract] The abstract would benefit from including at least one concrete metric (e.g., success-rate delta or recovery accuracy) to make the experimental claims more informative.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. The comments highlight important areas for strengthening the empirical and methodological support for our utility-preservation claims. We address each point below and have revised the manuscript to incorporate additional quantitative details, bounds, and analyses.

read point-by-point responses

Referee: [Abstract / Experiments] Abstract and Experiments section: the claim of utility preservation under compounding deviations is asserted but supported only by the statement that experiments demonstrate it, with no quantitative details on effect sizes, baselines, failure modes, or long-horizon sequence lengths provided, preventing verification of the no-accumulation condition.

Authors: We agree that the original presentation lacked sufficient quantitative detail to allow independent verification. In the revised manuscript we have expanded both the abstract and Experiments section with concrete metrics: utility is preserved within 1.8% of the unwatermarked baseline (measured via task success rate and cumulative reward) across all three environments; we report effect sizes via Cohen's d < 0.15; long-horizon runs extend to 50 sequential planning steps with no statistically significant accumulation (p > 0.2); and we explicitly catalog failure modes (primarily rare high-variance queries in the social environment) together with their observed frequency (< 3%). These additions directly address the no-accumulation condition. revision: yes
Referee: [Method] Method description: the approach of eliciting a behavior distribution via black-box queries and then applying conditional sampling is presented without variance bounds, sensitivity analysis, or explicit checks that small elicitation or conditioning deviations do not accumulate across sequential planning steps, which is load-bearing for the utility-preservation claim.

Authors: We have augmented the Method section with the requested elements. We now derive variance bounds on the elicited distribution using Hoeffding's inequality applied to the black-box query samples, and we include a sensitivity analysis showing that perturbations up to the observed query variance (typically < 0.05 in total variation distance) produce cumulative deviation bounded by O(1/sqrt(T)) over T steps. Explicit empirical checks across 50-step trajectories confirm that conditioning deviations remain below the utility threshold. These additions are supported by both theoretical statements and new figures in the revised text. revision: yes

Circularity Check

0 steps flagged

No circularity in derivation chain

full rationale

The paper proposes AgentMark as a framework that elicits an explicit behavior distribution from the agent and applies distribution-preserving conditional sampling to embed identifiers. No equations, derivations, or self-referential reductions appear in the abstract or described method; the approach uses standard sampling techniques without fitting parameters to the target result or invoking load-bearing self-citations. Experiments are presented as empirical validation rather than a closed logical loop, leaving the central claim self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The framework rests on the assumption that agent planning can be modeled as an explicit, elicitable probability distribution over behaviors and that small conditional adjustments to this distribution preserve long-term utility.

axioms (2)

domain assumption Agents expose or can be prompted to reveal an explicit behavior distribution over planning choices
Required for the conditional sampling step described in the abstract
domain assumption Minor distributional shifts in planning decisions do not compound into measurable utility loss over multi-step execution
Central to the utility-preservation claim

pith-pipeline@v0.9.0 · 5499 in / 1340 out tokens · 37030 ms · 2026-05-16T17:50:05.770581+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

AgentMark first elicits an explicit probability list over candidate behaviors and then embeds a watermark by distribution-preserving sampling on this elicited distribution, keeping the induced behavior distribution unchanged.
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Differential recombination constructs a mixture of uniform bins... Pr[ˆbt=bt,i]=pi

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

22 extracted references · 22 canonical work pages · 2 internal anchors

[1]

Improved unbiased watermark for large language models.arXiv preprint arXiv:2502.11268,

Improved unbiased watermark for large lan- guage models.arXiv preprint arXiv:2502.11268. Sumanth Dathathri, Abigail See, Sumedh Ghaisas, Po- Sen Huang, Rob McAdam, Johannes Welbl, Vandana Bachani, Alex Kaskasoli, Robert Stanforth, Tatiana Matejovicova, and 1 others. 2024. Scalable water- marking for identifying large language model outputs. Nature, 634(80...

work page arXiv 2024
[2]

S$^3$: Social-network Simulation System with Large Language Model-Empowered Agents

Ai agents under threat: A survey of key secu- rity challenges and future pathways.ACM Comput- ing Surveys, 57(7):1–36. Jinyang Ding, Kejiang Chen, Yaofei Wang, Na Zhao, Weiming Zhang, and Nenghai Yu. 2023. Discop: Provably secure steganography in practice based on" distribution copies". In2023 IEEE Symposium on Security and Privacy (SP), pages 2238–2255. ...

work page internal anchor Pith review arXiv 2023
[3]

Abe Hou, Jingyu Zhang, Tianxing He, Yichen Wang, Yung-Sung Chuang, Hongwei Wang, Lingfeng Shen, Benjamin Van Durme, Daniel Khashabi, and Yulia Tsvetkov

A random linear network coding approach to multicast.IEEE Transactions on information theory, 52(10):4413–4430. Abe Hou, Jingyu Zhang, Tianxing He, Yichen Wang, Yung-Sung Chuang, Hongwei Wang, Lingfeng Shen, Benjamin Van Durme, Daniel Khashabi, and Yulia Tsvetkov. 2024. Semstamp: A semantic watermark with paraphrastic robustness for text generation. In Pr...

work page arXiv 2024
[4]

Lei Wang, Jingsen Zhang, Hao Yang, Zhi-Yuan Chen, Jiakai Tang, Zeyu Zhang, Xu Chen, Yankai Lin, Hao Sun, Ruihua Song, and 1 others

Multi-agent systems execute arbitrary mali- cious code.arXiv preprint arXiv:2503.12188. Lei Wang, Jingsen Zhang, Hao Yang, Zhi-Yuan Chen, Jiakai Tang, Zeyu Zhang, Xu Chen, Yankai Lin, Hao Sun, Ruihua Song, and 1 others. 2025a. User be- havior simulation with large language model-based agents.ACM Transactions on Information Systems, 43(2):1–37. Yuntao Wang...

work page arXiv 2024
[5]

Towards Large Reasoning Models: A Survey of Reinforced Reasoning with Large Language Models

Towards large reasoning models: A survey of reinforced reasoning with large language models. arXiv preprint arXiv:2501.09686. Chao Yang, Chaochao Lu, Yingchun Wang, and Bowen Zhou. 2024a. Towards AI- 45◦ law: A roadmap to trustworthy AGI.arXiv preprint arXiv:2412.14186. Ziyi Yang, Zaibin Zhang, Zirui Zheng, Yuxian Jiang, Ziyue Gan, Zhiyu Wang, Zijian Ling...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[6]

Setting.Fix a time step t and a selected bin T containing n=|T| behaviors, which is treated as a uniform distribution over indices {0,1,

for further discussion and security proofs. Setting.Fix a time step t and a selected bin T containing n=|T| behaviors, which is treated as a uniform distribution over indices {0,1, . . . , n− 1}. Let M denote the payload bitstream and ℓ the current pointer, so the available suffix isM[ℓ: ]. Both encoder and decoder have synchronized access to a pseudorand...

work page 2025
[7]

Search locations like cabinet/drawer/countertop for the target object

work page
[8]

move <target> to <destination> pick_clean_then_place:

work page
[10]

clean <target> with sinkbasin 1

work page
[11]

move <target> to <destination> pick_heat_then_place:

work page
[13]

heat <target> with microwave 1

work page
[14]

move <target> to <destination> pick_cool_then_place:

work page
[15]

Find and take the target object

work page
[16]

cool <target> with fridge 1

work page
[17]

move <target> to <destination> pick_two_obj_and_place:

work page
[18]

take <target1>→go to <destination> →move <target1> to <destination>

work page
[19]

Return to origin→take <target2> →go to <destination>→move <target2> to <destination> look_at_obj_in_light:

work page
[20]

Find and take the target object (e.g., bowl/cd)

work page
[21]

go to the observed location of the desklamp

work page
[22]

use desklamp (no need to put the object down)

work page
[23]

action_weights

examine <target> Current Situation Recent History: {interaction_history_last_5_steps} Inventory:{inventory_status_and_checks} Task Goal:{task_description} Observation:{observation} Available Actions: {admissible_commands_json_list} Response Format Thinking:Write a ’Thinking: ...’ section to analyze the situation. Output:Output the JSON probability object ...

work page
[24]

Each episode is capped at a maximum of 25 decision steps, and the agent may terminate earlier upon completion

using Llama-3.2-3B-Instruct as the base model. Each episode is capped at a maximum of 25 decision steps, and the agent may terminate earlier upon completion. We generate 149 trajectories in total, and log the planning-time behavior choices (for AgentMark-F decoding) together with the final action-level textual outputs (for content-watermark detection). Wa...

work page 2024