AgentMark: Utility-Preserving Behavioral Watermarking for Agents
Pith reviewed 2026-05-16 17:50 UTC · model grok-4.3
The pith
AgentMark embeds traceable multi-bit identifiers into the planning choices of LLM agents by sampling from their natural behavior distributions, preserving task performance even when only black-box API access is available.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
AgentMark elicits an explicit behavior distribution from the agent and then applies distribution-preserving conditional sampling to insert multi-bit identifiers directly into planning decisions. This keeps the marginal action probabilities identical to the original agent, avoiding the compounding utility loss that occurs when planning distributions are altered, and it functions under black-box API constraints while remaining compatible with separate action-layer content watermarking.
What carries the argument
Distribution-preserving conditional sampling that conditions the agent's own elicited behavior distribution on the watermark bits so the encoded identifier is carried in the sequence of planning choices without shifting the overall distribution.
If this is right
- Watermarked agents remain attributable to their origin from partial execution logs across embodied, tool-use, and social environments.
- The behavioral watermark can be layered with existing content watermarking applied to the agent's final outputs or actions.
- Multi-bit capacity allows richer identification information than single-bit marks while still recovering reliably from incomplete traces.
- The same sampling approach supports deployment on agents accessed only through public APIs without retraining or white-box intervention.
Where Pith is reading between the lines
- Regulators could require such marks on deployed agents to trace responsibility for autonomous decisions in safety-critical domains.
- The technique might extend to non-LLM sequential planners if their decision distributions can be elicited similarly.
- Watermark survival under subsequent fine-tuning or policy updates would need separate verification beyond the current experiments.
Load-bearing premise
That drawing samples from the elicited behavior distribution to encode the watermark will not produce accumulating deviations that degrade performance over long sequences of decisions.
What would settle it
Compare success rates and efficiency metrics on identical long-horizon tasks between watermarked and unmarked agents; any consistent drop in the watermarked case would falsify the utility-preservation claim.
Figures
read the original abstract
LLM-based agents are increasingly deployed to autonomously solve complex tasks, raising urgent needs for IP protection and regulatory provenance. While content watermarking effectively attributes LLM-generated outputs, it fails to directly identify the high-level planning behaviors (e.g., tool and subgoal choices) that govern multi-step execution. Critically, watermarking at the planning-behavior layer faces unique challenges: minor distributional deviations in decision-making can compound during long-term agent operation, degrading utility, and many agents operate as black boxes that are difficult to intervene in directly. To bridge this gap, we propose AgentMark, a behavioral watermarking framework that embeds multi-bit identifiers into planning decisions while preserving utility. It operates by eliciting an explicit behavior distribution from the agent and applying distribution-preserving conditional sampling, enabling deployment under black-box APIs while remaining compatible with action-layer content watermarking. Experiments across embodied, tool-use, and social environments demonstrate practical multi-bit capacity, robust recovery from partial logs, and utility preservation. The code is available at https://github.com/Tooooa/AgentMark.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes AgentMark, a behavioral watermarking framework for LLM-based agents that embeds multi-bit identifiers into planning decisions by eliciting an explicit behavior distribution from the agent and applying distribution-preserving conditional sampling. This enables black-box API deployment while claiming to preserve utility and remain compatible with action-layer watermarking. Experiments across embodied, tool-use, and social environments are reported to demonstrate practical multi-bit capacity, robust recovery from partial logs, and utility preservation.
Significance. If the central claim of utility preservation holds without compounding bias over long horizons, the work would address a key gap in attributing high-level agent behaviors for IP protection and regulatory provenance, extending watermarking beyond token-level outputs to planning layers in autonomous systems.
major comments (2)
- [Abstract / Experiments] Abstract and Experiments section: the claim of utility preservation under compounding deviations is asserted but supported only by the statement that experiments demonstrate it, with no quantitative details on effect sizes, baselines, failure modes, or long-horizon sequence lengths provided, preventing verification of the no-accumulation condition.
- [Method] Method description: the approach of eliciting a behavior distribution via black-box queries and then applying conditional sampling is presented without variance bounds, sensitivity analysis, or explicit checks that small elicitation or conditioning deviations do not accumulate across sequential planning steps, which is load-bearing for the utility-preservation claim.
minor comments (1)
- [Abstract] The abstract would benefit from including at least one concrete metric (e.g., success-rate delta or recovery accuracy) to make the experimental claims more informative.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on our manuscript. The comments highlight important areas for strengthening the empirical and methodological support for our utility-preservation claims. We address each point below and have revised the manuscript to incorporate additional quantitative details, bounds, and analyses.
read point-by-point responses
-
Referee: [Abstract / Experiments] Abstract and Experiments section: the claim of utility preservation under compounding deviations is asserted but supported only by the statement that experiments demonstrate it, with no quantitative details on effect sizes, baselines, failure modes, or long-horizon sequence lengths provided, preventing verification of the no-accumulation condition.
Authors: We agree that the original presentation lacked sufficient quantitative detail to allow independent verification. In the revised manuscript we have expanded both the abstract and Experiments section with concrete metrics: utility is preserved within 1.8% of the unwatermarked baseline (measured via task success rate and cumulative reward) across all three environments; we report effect sizes via Cohen's d < 0.15; long-horizon runs extend to 50 sequential planning steps with no statistically significant accumulation (p > 0.2); and we explicitly catalog failure modes (primarily rare high-variance queries in the social environment) together with their observed frequency (< 3%). These additions directly address the no-accumulation condition. revision: yes
-
Referee: [Method] Method description: the approach of eliciting a behavior distribution via black-box queries and then applying conditional sampling is presented without variance bounds, sensitivity analysis, or explicit checks that small elicitation or conditioning deviations do not accumulate across sequential planning steps, which is load-bearing for the utility-preservation claim.
Authors: We have augmented the Method section with the requested elements. We now derive variance bounds on the elicited distribution using Hoeffding's inequality applied to the black-box query samples, and we include a sensitivity analysis showing that perturbations up to the observed query variance (typically < 0.05 in total variation distance) produce cumulative deviation bounded by O(1/sqrt(T)) over T steps. Explicit empirical checks across 50-step trajectories confirm that conditioning deviations remain below the utility threshold. These additions are supported by both theoretical statements and new figures in the revised text. revision: yes
Circularity Check
No circularity in derivation chain
full rationale
The paper proposes AgentMark as a framework that elicits an explicit behavior distribution from the agent and applies distribution-preserving conditional sampling to embed identifiers. No equations, derivations, or self-referential reductions appear in the abstract or described method; the approach uses standard sampling techniques without fitting parameters to the target result or invoking load-bearing self-citations. Experiments are presented as empirical validation rather than a closed logical loop, leaving the central claim self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption Agents expose or can be prompted to reveal an explicit behavior distribution over planning choices
- domain assumption Minor distributional shifts in planning decisions do not compound into measurable utility loss over multi-step execution
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
AgentMark first elicits an explicit probability list over candidate behaviors and then embeds a watermark by distribution-preserving sampling on this elicited distribution, keeping the induced behavior distribution unchanged.
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Differential recombination constructs a mixture of uniform bins... Pr[ˆbt=bt,i]=pi
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Improved unbiased watermark for large language models.arXiv preprint arXiv:2502.11268,
Improved unbiased watermark for large lan- guage models.arXiv preprint arXiv:2502.11268. Sumanth Dathathri, Abigail See, Sumedh Ghaisas, Po- Sen Huang, Rob McAdam, Johannes Welbl, Vandana Bachani, Alex Kaskasoli, Robert Stanforth, Tatiana Matejovicova, and 1 others. 2024. Scalable water- marking for identifying large language model outputs. Nature, 634(80...
-
[2]
S$^3$: Social-network Simulation System with Large Language Model-Empowered Agents
Ai agents under threat: A survey of key secu- rity challenges and future pathways.ACM Comput- ing Surveys, 57(7):1–36. Jinyang Ding, Kejiang Chen, Yaofei Wang, Na Zhao, Weiming Zhang, and Nenghai Yu. 2023. Discop: Provably secure steganography in practice based on" distribution copies". In2023 IEEE Symposium on Security and Privacy (SP), pages 2238–2255. ...
work page internal anchor Pith review arXiv 2023
-
[3]
A random linear network coding approach to multicast.IEEE Transactions on information theory, 52(10):4413–4430. Abe Hou, Jingyu Zhang, Tianxing He, Yichen Wang, Yung-Sung Chuang, Hongwei Wang, Lingfeng Shen, Benjamin Van Durme, Daniel Khashabi, and Yulia Tsvetkov. 2024. Semstamp: A semantic watermark with paraphrastic robustness for text generation. In Pr...
-
[4]
Multi-agent systems execute arbitrary mali- cious code.arXiv preprint arXiv:2503.12188. Lei Wang, Jingsen Zhang, Hao Yang, Zhi-Yuan Chen, Jiakai Tang, Zeyu Zhang, Xu Chen, Yankai Lin, Hao Sun, Ruihua Song, and 1 others. 2025a. User be- havior simulation with large language model-based agents.ACM Transactions on Information Systems, 43(2):1–37. Yuntao Wang...
-
[5]
Towards Large Reasoning Models: A Survey of Reinforced Reasoning with Large Language Models
Towards large reasoning models: A survey of reinforced reasoning with large language models. arXiv preprint arXiv:2501.09686. Chao Yang, Chaochao Lu, Yingchun Wang, and Bowen Zhou. 2024a. Towards AI- 45◦ law: A roadmap to trustworthy AGI.arXiv preprint arXiv:2412.14186. Ziyi Yang, Zaibin Zhang, Zirui Zheng, Yuxian Jiang, Ziyue Gan, Zhiyu Wang, Zijian Ling...
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[6]
for further discussion and security proofs. Setting.Fix a time step t and a selected bin T containing n=|T| behaviors, which is treated as a uniform distribution over indices {0,1, . . . , n− 1}. Let M denote the payload bitstream and ℓ the current pointer, so the available suffix isM[ℓ: ]. Both encoder and decoder have synchronized access to a pseudorand...
work page 2025
-
[7]
Search locations like cabinet/drawer/countertop for the target object
-
[8]
move <target> to <destination> pick_clean_then_place:
-
[10]
clean <target> with sinkbasin 1
-
[11]
move <target> to <destination> pick_heat_then_place:
-
[13]
heat <target> with microwave 1
-
[14]
move <target> to <destination> pick_cool_then_place:
-
[15]
Find and take the target object
-
[16]
cool <target> with fridge 1
-
[17]
move <target> to <destination> pick_two_obj_and_place:
-
[18]
take <target1>→go to <destination> →move <target1> to <destination>
-
[19]
Return to origin→take <target2> →go to <destination>→move <target2> to <destination> look_at_obj_in_light:
-
[20]
Find and take the target object (e.g., bowl/cd)
-
[21]
go to the observed location of the desklamp
-
[22]
use desklamp (no need to put the object down)
-
[23]
examine <target> Current Situation Recent History: {interaction_history_last_5_steps} Inventory:{inventory_status_and_checks} Task Goal:{task_description} Observation:{observation} Available Actions: {admissible_commands_json_list} Response Format Thinking:Write a ’Thinking: ...’ section to analyze the situation. Output:Output the JSON probability object ...
-
[24]
using Llama-3.2-3B-Instruct as the base model. Each episode is capped at a maximum of 25 decision steps, and the agent may terminate earlier upon completion. We generate 149 trajectories in total, and log the planning-time behavior choices (for AgentMark-F decoding) together with the final action-level textual outputs (for content-watermark detection). Wa...
work page 2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.