PASA: A Principled Embedding-Space Watermarking Approach for LLM-Generated Text under Semantic-Invariant Attacks
Pith reviewed 2026-05-13 01:09 UTC · model grok-4.3
The pith
PASA embeds watermarks in LLM semantic embedding space to detect generated text after paraphrasing without distorting output.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
PASA constructs a distributional dependency between token sequences and auxiliary sequences by synchronizing randomness with a secret key and semantic history inside semantic clusters of the latent embedding space. This construction is derived from a theoretical characterization of jointly optimal embedding and detection functions that balance detection accuracy, robustness to semantic-invariant changes, and zero distortion. Experiments on multiple LLMs show the resulting watermark survives strong paraphrasing attacks at higher rates than vocabulary-space baselines while leaving text quality unchanged.
What carries the argument
Semantic clusters in the latent embedding space together with shared-randomness distributional dependency synchronized by secret key and semantic history, which enables joint optimization of embedding and detection.
If this is right
- Detection accuracy stays high after semantic-preserving rewrites that defeat token-level methods.
- Generated text quality remains comparable to unwatermarked output because no token bias is introduced.
- The theoretical trade-off surface among accuracy, robustness, and distortion is achieved by the embedding-detection pair.
- Hyperparameter choices validated by ablation directly support the observed robustness without quality loss.
Where Pith is reading between the lines
- If the synchronization mechanism holds, similar semantic-level dependencies could be applied to other generative models where meaning must survive transformation.
- The approach implies that watermark verification can be performed on rewritten text without needing the original prompt or intermediate tokens.
- Success here would motivate checking whether the same cluster-and-dependency pattern reduces false positives when watermarking is combined with other detection signals.
Load-bearing premise
Semantic clusters can be formed reliably in the embedding space and the synchronized randomness produces a distributional dependency that delivers the stated optimality and robustness without creating detectable artifacts or exploitable weaknesses.
What would settle it
Running the strongest paraphrasing attack described in the paper on PASA-watermarked text and finding detection accuracy no higher than that of a standard vocabulary-space watermark, or finding statistical patterns in the output that reveal the watermark without knowledge of the secret key.
Figures
read the original abstract
Watermarking for large language models (LLMs) is a promising approach for detecting LLM-generated text and enabling responsible deployment. However, existing watermarking methods are often vulnerable to semantic-invariant attacks, such as paraphrasing. We propose PASA, a principled, robust, and distortion-free watermarking algorithm that embeds and detects a watermark at the semantic level. PASA operates on semantic clusters in a latent embedding space and constructs a distributional dependency between token and auxiliary sequences via shared randomness synchronized by a secret key and semantic history. This design is grounded in our theoretical framework that characterizes a jointly optimal embedding-detection pair, achieving the fundamental trade-offs among detection accuracy, robustness, and distortion. Evaluations across multiple LLMs and semantic-invariant attacks demonstrate that PASA remains robust even under strong paraphrasing attacks while preserving high text quality, outperforming standard vocabulary-space baselines. Ablation studies further validate the effectiveness of our hyperparameter choices. Webpage: https://ai-kunkun.github.io/PASA_page/.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes PASA, a watermarking algorithm for LLM-generated text that embeds and detects watermarks in a latent embedding space using semantic clusters. It constructs distributional dependencies via shared randomness synchronized by a secret key and semantic history, grounded in a theoretical framework characterizing jointly optimal embedding-detection pairs that trade off detection accuracy, robustness, and distortion. Evaluations across LLMs and semantic-invariant attacks (including strong paraphrasing) claim superior robustness and text quality compared to vocabulary-space baselines, with ablations validating hyperparameter choices.
Significance. If the theoretical optimality derivation holds and the reported robustness metrics are reproducible under the described attack strengths, PASA would represent a meaningful advance in LLM watermarking by addressing the vulnerability of prior methods to paraphrasing and other semantic-preserving transformations. The embedding-space approach and explicit focus on joint optimality are strengths that could inform future designs.
major comments (2)
- [§3.1–3.3] §3.1–3.3 (theoretical framework): the characterization of jointly optimal embedding-detection pairs relies on semantic cluster construction and randomness synchronization; the derivation should explicitly show whether optimality is parameter-free or reduces to choices of cluster granularity and history window length, as these appear among the free parameters.
- [§4.3] §4.3 (experimental results on paraphrasing): the claim of remaining robust under strong paraphrasing requires quantitative attack details (e.g., semantic similarity thresholds, paraphrase model, number of rewrites) and effect sizes with error bars; without these, the outperformance over vocabulary baselines cannot be fully assessed as load-bearing evidence.
minor comments (2)
- [Abstract] Abstract: quantitative metrics, specific LLMs tested, and attack strengths are referenced but not summarized; adding one sentence with key numbers would improve clarity.
- [§5] §5 (ablations): ensure all tested hyperparameter ranges and the exact cluster construction algorithm (including any embedding model) are listed for reproducibility.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on our manuscript. We address each major comment point by point below, indicating planned revisions where appropriate to improve clarity and completeness.
read point-by-point responses
-
Referee: [§3.1–3.3] §3.1–3.3 (theoretical framework): the characterization of jointly optimal embedding-detection pairs relies on semantic cluster construction and randomness synchronization; the derivation should explicitly show whether optimality is parameter-free or reduces to choices of cluster granularity and history window length, as these appear among the free parameters.
Authors: We appreciate the referee highlighting this aspect of the theoretical framework. The derivation of jointly optimal embedding-detection pairs is performed conditionally on a fixed semantic cluster granularity and history window length; these are treated as design hyperparameters that control the granularity of the semantic partitioning and the extent of distributional dependence. The optimality result characterizes the fundamental trade-offs for any given choice of these parameters rather than claiming parameter-free optimality. In the revised manuscript we will add an explicit statement in §3 clarifying this conditional nature and include a brief discussion of how varying cluster granularity and window length affect the achievable accuracy-robustness-distortion frontier. revision: yes
-
Referee: [§4.3] §4.3 (experimental results on paraphrasing): the claim of remaining robust under strong paraphrasing requires quantitative attack details (e.g., semantic similarity thresholds, paraphrase model, number of rewrites) and effect sizes with error bars; without these, the outperformance over vocabulary baselines cannot be fully assessed as load-bearing evidence.
Authors: We agree that additional quantitative details are required to make the robustness claims fully reproducible and to allow readers to assess the strength of the reported outperformance. The current manuscript describes the paraphrasing attacks at a high level but does not enumerate the exact paraphrase model, similarity thresholds, number of rewrites, or report error bars. In the revision we will expand §4.3 (and the experimental setup subsection) to specify the paraphrase model, the semantic similarity thresholds employed, the number of rewrites applied, and to present all detection metrics with error bars computed across multiple independent runs. These additions will enable direct evaluation of the evidence. revision: yes
Circularity Check
No significant circularity; theoretical framework presented as independent grounding
full rationale
The abstract grounds the PASA design in a theoretical framework characterizing jointly optimal embedding-detection pairs and trade-offs among accuracy, robustness, and distortion. No equations or self-citations are supplied in the given material that would reduce this framework to a redefinition of the algorithm's own cluster-construction or randomness parameters. Evaluations on multiple LLMs and attacks are described as external validation, with no indication that predictions reduce by construction to fitted inputs or prior self-citations. The derivation chain therefore remains self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
free parameters (2)
- semantic cluster construction parameters
- randomness synchronization threshold or history window
axioms (2)
- domain assumption Semantic clusters in the latent embedding space exist and can be reliably identified across paraphrases
- domain assumption A jointly optimal embedding-detection pair exists and is characterized by the theoretical framework
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.