PASA: A Principled Embedding-Space Watermarking Approach for LLM-Generated Text under Semantic-Invariant Attacks

Haiyun He; Zhenxin Ai

arxiv: 2605.10977 · v2 · pith:DFFDPXFGnew · submitted 2026-05-09 · 💻 cs.CR · cs.AI

PASA: A Principled Embedding-Space Watermarking Approach for LLM-Generated Text under Semantic-Invariant Attacks

Zhenxin Ai , Haiyun He This is my paper

Pith reviewed 2026-05-13 01:09 UTC · model grok-4.3

classification 💻 cs.CR cs.AI

keywords LLM watermarkingsemantic embedding spaceparaphrasing attackstext detectionrobust watermarkingdistributional dependencydistortion-freesemantic-invariant attacks

0 comments

The pith

PASA embeds watermarks in LLM semantic embedding space to detect generated text after paraphrasing without distorting output.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper seeks to prove that watermarking LLM output at the semantic level, rather than the token level, can survive attacks that rewrite text while preserving meaning. It does so by grouping tokens into semantic clusters in latent space and tying their statistics to an auxiliary sequence through randomness shared via a secret key and the semantic history of the generation. A sympathetic reader would care because current watermark detectors lose reliability once text is paraphrased, yet responsible use of LLMs requires dependable detection without forcing writers to accept lower-quality output. If the method works, detection becomes possible even on heavily rewritten passages while the original text remains statistically indistinguishable from human writing.

Core claim

PASA constructs a distributional dependency between token sequences and auxiliary sequences by synchronizing randomness with a secret key and semantic history inside semantic clusters of the latent embedding space. This construction is derived from a theoretical characterization of jointly optimal embedding and detection functions that balance detection accuracy, robustness to semantic-invariant changes, and zero distortion. Experiments on multiple LLMs show the resulting watermark survives strong paraphrasing attacks at higher rates than vocabulary-space baselines while leaving text quality unchanged.

What carries the argument

Semantic clusters in the latent embedding space together with shared-randomness distributional dependency synchronized by secret key and semantic history, which enables joint optimization of embedding and detection.

If this is right

Detection accuracy stays high after semantic-preserving rewrites that defeat token-level methods.
Generated text quality remains comparable to unwatermarked output because no token bias is introduced.
The theoretical trade-off surface among accuracy, robustness, and distortion is achieved by the embedding-detection pair.
Hyperparameter choices validated by ablation directly support the observed robustness without quality loss.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the synchronization mechanism holds, similar semantic-level dependencies could be applied to other generative models where meaning must survive transformation.
The approach implies that watermark verification can be performed on rewritten text without needing the original prompt or intermediate tokens.
Success here would motivate checking whether the same cluster-and-dependency pattern reduces false positives when watermarking is combined with other detection signals.

Load-bearing premise

Semantic clusters can be formed reliably in the embedding space and the synchronized randomness produces a distributional dependency that delivers the stated optimality and robustness without creating detectable artifacts or exploitable weaknesses.

What would settle it

Running the strongest paraphrasing attack described in the paper on PASA-watermarked text and finding detection accuracy no higher than that of a standard vocabulary-space watermark, or finding statistical patterns in the output that reveal the watermark without knowledge of the secret key.

Figures

Figures reproduced from arXiv: 2605.10977 by Haiyun He, Zhenxin Ai.

**Figure 1.** Figure 1: Left: Illustration of PASA, a principled watermarking approach operating in the latent embedding space on semantic clusters. By anchoring shared randomness to semantic clusters via a secret key, PASA remains robust against semantic-invariant attacks (e.g., paraphrasing) while ensuring distortion-free generation. Right: Quantitative results demonstrating that PASA outperforms standard vocabulary-space water… view at source ↗

**Figure 2.** Figure 2: Overview of PASA. Left: Construction of the semantic mapping function f, which partitions the latent token embedding space into K semantic clusters. Right: Top (Generation). (G1) At each step t, the NTP distribution Qt is transformed into the cluster distribution Q f t . (G2) The auxiliary distribution Pζt is truncated by a threshold α and contains an overflow state ˜ζ to ensure FA error control. (G3) Auxi… view at source ↗

**Figure 3.** Figure 3: Ablation study on hyper-parameters. (a) Impact of semantic cluster granularity (K) on robustness across log-scale cluster counts. (b) Impact of synchronization window size (w) on robustness. The plots compare the baseline (Original) against T5-based token replacement attacks (r = 0.3, 0.5). generations as well; see Appendix A. Computational Efficiency. To quantify runtime overhead, we measure average laten… view at source ↗

**Figure 4.** Figure 4: Detection performance across various generated text lengths. The ROC-AUC and True Positive Rate (TPR) exhibit rapid convergence, achieving near-perfect detection beyond 300 tokens. Generalization Analysis on the ELI5 Dataset. The ELI5 dataset is designed for long-form question answering, requiring models to produce detailed explanations for complex queries. We use this dataset to evaluate the generalizatio… view at source ↗

read the original abstract

Watermarking for large language models (LLMs) is a promising approach for detecting LLM-generated text and enabling responsible deployment. However, existing watermarking methods are often vulnerable to semantic-invariant attacks, such as paraphrasing. We propose PASA, a principled, robust, and distortion-free watermarking algorithm that embeds and detects a watermark at the semantic level. PASA operates on semantic clusters in a latent embedding space and constructs a distributional dependency between token and auxiliary sequences via shared randomness synchronized by a secret key and semantic history. This design is grounded in our theoretical framework that characterizes a jointly optimal embedding-detection pair, achieving the fundamental trade-offs among detection accuracy, robustness, and distortion. Evaluations across multiple LLMs and semantic-invariant attacks demonstrate that PASA remains robust even under strong paraphrasing attacks while preserving high text quality, outperforming standard vocabulary-space baselines. Ablation studies further validate the effectiveness of our hyperparameter choices. Webpage: https://ai-kunkun.github.io/PASA_page/.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

PASA moves watermarking into semantic embedding clusters with shared-randomness dependencies to target paraphrasing robustness, but the abstract supplies no numbers or derivations to check the optimality claims.

read the letter

PASA's main contribution is shifting the watermark from vocabulary space into semantic clusters in the latent embedding space, then building a distributional link between the token sequence and an auxiliary sequence through randomness that is synchronized by a secret key plus semantic history. The authors frame this as coming from a theoretical characterization of jointly optimal embedding-detection pairs that respect the accuracy-robustness-distortion trade-off. That framing is a step beyond the usual heuristic tweaks in the literature, and the idea of operating at the semantic level makes sense as a response to paraphrasing attacks that leave meaning intact but scramble token-level signals. If the full derivation actually derives the optimality conditions rather than retrofitting them to chosen cluster sizes and randomness parameters, it could give the field a cleaner way to reason about these methods. The abstract also notes evaluations across LLMs, multiple semantic-invariant attacks, and hyperparameter ablations, which at least shows they tried to test the practical side rather than stopping at the theory sketch. The direction is timely for anyone working on detection and accountability for generated text. That said, the summary contains zero quantitative results—no detection rates, no distortion scores, no attack-strength details, no error bars—so it is impossible to tell whether the robustness gain is real or modest. The reliability of the semantic clusters themselves is left unexamined in the provided material; if embeddings do not form stable, model-agnostic clusters, the whole construction could degrade or introduce new artifacts. The shared-randomness synchronization also needs explicit checks that it does not leak information or create detectable statistical patterns under the very attacks it claims to resist. This paper is for researchers already following watermarking and semantic-attack work. A reader looking for a new conceptual handle on the robustness problem will find the angle worth examining, even if the current write-up leaves the claims under-supported. It is solid enough in its novelty and problem framing to merit peer review rather than a desk reject, provided the authors supply the missing metrics, derivations, and implementation specifics in revision.

Referee Report

2 major / 2 minor

Summary. The paper proposes PASA, a watermarking algorithm for LLM-generated text that embeds and detects watermarks in a latent embedding space using semantic clusters. It constructs distributional dependencies via shared randomness synchronized by a secret key and semantic history, grounded in a theoretical framework characterizing jointly optimal embedding-detection pairs that trade off detection accuracy, robustness, and distortion. Evaluations across LLMs and semantic-invariant attacks (including strong paraphrasing) claim superior robustness and text quality compared to vocabulary-space baselines, with ablations validating hyperparameter choices.

Significance. If the theoretical optimality derivation holds and the reported robustness metrics are reproducible under the described attack strengths, PASA would represent a meaningful advance in LLM watermarking by addressing the vulnerability of prior methods to paraphrasing and other semantic-preserving transformations. The embedding-space approach and explicit focus on joint optimality are strengths that could inform future designs.

major comments (2)

[§3.1–3.3] §3.1–3.3 (theoretical framework): the characterization of jointly optimal embedding-detection pairs relies on semantic cluster construction and randomness synchronization; the derivation should explicitly show whether optimality is parameter-free or reduces to choices of cluster granularity and history window length, as these appear among the free parameters.
[§4.3] §4.3 (experimental results on paraphrasing): the claim of remaining robust under strong paraphrasing requires quantitative attack details (e.g., semantic similarity thresholds, paraphrase model, number of rewrites) and effect sizes with error bars; without these, the outperformance over vocabulary baselines cannot be fully assessed as load-bearing evidence.

minor comments (2)

[Abstract] Abstract: quantitative metrics, specific LLMs tested, and attack strengths are referenced but not summarized; adding one sentence with key numbers would improve clarity.
[§5] §5 (ablations): ensure all tested hyperparameter ranges and the exact cluster construction algorithm (including any embedding model) are listed for reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment point by point below, indicating planned revisions where appropriate to improve clarity and completeness.

read point-by-point responses

Referee: [§3.1–3.3] §3.1–3.3 (theoretical framework): the characterization of jointly optimal embedding-detection pairs relies on semantic cluster construction and randomness synchronization; the derivation should explicitly show whether optimality is parameter-free or reduces to choices of cluster granularity and history window length, as these appear among the free parameters.

Authors: We appreciate the referee highlighting this aspect of the theoretical framework. The derivation of jointly optimal embedding-detection pairs is performed conditionally on a fixed semantic cluster granularity and history window length; these are treated as design hyperparameters that control the granularity of the semantic partitioning and the extent of distributional dependence. The optimality result characterizes the fundamental trade-offs for any given choice of these parameters rather than claiming parameter-free optimality. In the revised manuscript we will add an explicit statement in §3 clarifying this conditional nature and include a brief discussion of how varying cluster granularity and window length affect the achievable accuracy-robustness-distortion frontier. revision: yes
Referee: [§4.3] §4.3 (experimental results on paraphrasing): the claim of remaining robust under strong paraphrasing requires quantitative attack details (e.g., semantic similarity thresholds, paraphrase model, number of rewrites) and effect sizes with error bars; without these, the outperformance over vocabulary baselines cannot be fully assessed as load-bearing evidence.

Authors: We agree that additional quantitative details are required to make the robustness claims fully reproducible and to allow readers to assess the strength of the reported outperformance. The current manuscript describes the paraphrasing attacks at a high level but does not enumerate the exact paraphrase model, similarity thresholds, number of rewrites, or report error bars. In the revision we will expand §4.3 (and the experimental setup subsection) to specify the paraphrase model, the semantic similarity thresholds employed, the number of rewrites applied, and to present all detection metrics with error bars computed across multiple independent runs. These additions will enable direct evaluation of the evidence. revision: yes

Circularity Check

0 steps flagged

No significant circularity; theoretical framework presented as independent grounding

full rationale

The abstract grounds the PASA design in a theoretical framework characterizing jointly optimal embedding-detection pairs and trade-offs among accuracy, robustness, and distortion. No equations or self-citations are supplied in the given material that would reduce this framework to a redefinition of the algorithm's own cluster-construction or randomness parameters. Evaluations on multiple LLMs and attacks are described as external validation, with no indication that predictions reduce by construction to fitted inputs or prior self-citations. The derivation chain therefore remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

2 free parameters · 2 axioms · 0 invented entities

Abstract-only view yields limited visibility into parameters; the method relies on semantic embedding clusters and synchronized randomness whose precise definitions and any fitting procedures are not detailed.

free parameters (2)

semantic cluster construction parameters
Hyperparameters that define clusters in the embedding space are required for the method but not quantified in the abstract.
randomness synchronization threshold or history window
Parameters controlling how semantic history synchronizes the shared randomness are implicit in the design.

axioms (2)

domain assumption Semantic clusters in the latent embedding space exist and can be reliably identified across paraphrases
The core operation of PASA presupposes stable semantic clustering that survives semantic-invariant attacks.
domain assumption A jointly optimal embedding-detection pair exists and is characterized by the theoretical framework
The paper states the design is grounded in this framework without providing the derivation in the abstract.

pith-pipeline@v0.9.0 · 5475 in / 1529 out tokens · 62455 ms · 2026-05-13T01:09:25.478980+00:00 · methodology

PASA: A Principled Embedding-Space Watermarking Approach for LLM-Generated Text under Semantic-Invariant Attacks

Core claim

What carries the argument

If this is right

Where Pith is reading between the lines

Load-bearing premise

What would settle it

discussion (0)