arxiv: 2602.14286 · v4 · submitted 2026-02-15 · 📊 stat.ME · stat.ML

Recognition: 2 theorem links

· Lean Theorem

Online LLM watermark detection via e-processes

Weijie Su , Ruodu Wang , Zinan Zhao

Authors on Pith no claims yet

Pith reviewed 2026-05-15 21:46 UTC · model grok-4.3

classification 📊 stat.ME stat.ML

keywords LLM watermark detectione-processesonline testingindependence testingadaptive e-processesanytime-valid inferencestatistical power analysis

0 comments

The pith

E-processes provide a unified framework for online detection of watermarks in LLM-generated text with anytime-valid guarantees.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops a statistical framework for detecting watermarks in large language model outputs using e-processes. This approach converts the detection problem into testing for independence between tokens and a pseudo-random sequence. It offers methods to build adaptive e-processes that improve detection power while maintaining validity at any time during sequential testing. The framework applies to any setting with independent pivotal statistics and includes theoretical analysis of its power. Experiments show it performs competitively with existing methods.

Core claim

We develop a unified framework for LLM watermark detection based on e-processes, providing anytime-valid guarantees for online testing. We propose various methods to construct empirically adaptive e-processes that can enhance the detection power. The proposed methods are applicable to any sequential testing problem where independent pivotal statistics are available. Theoretical results characterize the power properties of the proposed procedures.

What carries the argument

E-processes, sequences of nonnegative random variables that are supermartingales under the null of independence, used to construct anytime-valid sequential tests for watermark detection.

Load-bearing premise

Watermark schemes reliably induce dependence between generated tokens and a pseudo-random sequence, allowing reduction to an independence testing problem with available independent pivotal statistics.

What would settle it

An experiment showing that the constructed e-process fails to reject the null on a known watermarked text stream or produces invalid p-values when the pivotal statistics violate independence.

read the original abstract

Watermarking for large language models (LLMs) has emerged as an effective tool for distinguishing AI-generated text from human-written content. Statistically, watermark schemes induce dependence between generated tokens and a pseudo-random sequence, reducing watermark detection to a hypothesis testing problem on independence. We develop a unified framework for LLM watermark detection based on e-processes, providing anytime-valid guarantees for online testing. We propose various methods to construct empirically adaptive e-processes that can enhance the detection power. The proposed methods are applicable to any sequential testing problem where independent pivotal statistics are available. In addition, theoretical results are established to characterize the power properties of the proposed procedures. Some experiments demonstrate that the proposed framework achieves competitive performance compared to existing watermark detection methods.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This paper gives a unified e-process framework for online LLM watermark detection with adaptive constructions that keep anytime-valid guarantees, but it is mostly an incremental application rather than a foundational advance.

read the letter

The punchline is that the authors reduce watermark detection to sequential independence testing and then build e-processes with data-driven adaptations to raise power while preserving anytime-valid error control. This setup applies to any problem with independent pivotal statistics, and they supply power characterizations plus experiments that look competitive with prior detectors. The adaptive constructions are the main technical step beyond standard e-process use, and the anytime-valid property is genuinely useful for platforms that need to flag AI text in real time without fixed sample sizes. The reduction itself is standard for common watermark schemes, so the novelty sits in the unified online framing and the empirical adaptations rather than new theory. I see no circularity or load-bearing assumptions that contradict the setup; the work builds cleanly on existing e-process results. Soft spots are modest: the abstract leaves the exact form of the adaptive e-processes and the experimental protocols a bit thin, so the claimed power gains need the full derivations and data details to be fully convincing. The assumption that watermarks reliably induce the required dependence holds for the schemes they target but could weaken if future schemes change. Overall the math looks solid on its own terms and the experiments are presented as supportive rather than decisive. This is for readers working on sequential testing or practical AI-content detection; it is not reshaping core statistics but supplies a usable tool. It deserves a serious referee to check the adaptive constructions and power bounds.

Referee Report

2 major / 2 minor

Summary. The paper develops a unified framework for LLM watermark detection based on e-processes, providing anytime-valid guarantees for online sequential testing. It reduces watermark detection to an independence testing problem under the assumption that watermark schemes induce dependence between tokens and a pseudo-random sequence, proposes methods for constructing empirically adaptive e-processes to improve detection power, establishes theoretical results on power properties, and demonstrates competitive experimental performance against existing methods. The framework is positioned as applicable to any sequential testing setting with independent pivotal statistics.

Significance. If the central reduction to independence testing holds and the adaptive e-process constructions deliver the claimed power gains while preserving anytime-validity, the work would provide a statistically rigorous online monitoring tool for AI-generated text, extending e-process theory to a timely application area. The emphasis on broad applicability and theoretical power characterizations strengthens its potential impact beyond LLM watermarking.

major comments (2)

[§2 (Problem Setup and Reduction)] The central claim relies on the availability of independent pivotal statistics for e-process construction after the reduction to independence testing. The manuscript should explicitly verify this condition for standard watermark schemes (e.g., those inducing token-sequence dependence) with a concrete example or lemma showing that the resulting statistics remain independent and pivotal under the null.
[§3 (Theoretical Results)] Theoretical power results are stated to characterize the procedures, but the growth rate of the adaptive e-processes under alternatives (relative to non-adaptive baselines) needs an explicit bound or comparison theorem to substantiate the claimed enhancement; without it, the adaptivity benefit remains qualitative.

minor comments (2)

[§4 (Experiments)] In the experiments, report quantitative metrics such as empirical power at fixed type-I error levels and the number of tokens required for detection across multiple watermark schemes and text lengths to allow direct comparison with baselines.
[§3 (Methods)] Clarify the precise definition of 'empirically adaptive' e-processes (e.g., how the adaptation is performed without using future data) in the methods section to avoid ambiguity for readers unfamiliar with e-process literature.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their positive evaluation and insightful comments. We address each major comment below and will incorporate the suggested changes in the revised manuscript.

read point-by-point responses

Referee: [§2 (Problem Setup and Reduction)] The central claim relies on the availability of independent pivotal statistics for e-process construction after the reduction to independence testing. The manuscript should explicitly verify this condition for standard watermark schemes (e.g., those inducing token-sequence dependence) with a concrete example or lemma showing that the resulting statistics remain independent and pivotal under the null.

Authors: We appreciate this suggestion. Upon review, we note that the reduction to independence testing is standard for watermark schemes that introduce dependence between tokens and the pseudo-random sequence. In the revised version, we will add a lemma in §2 with a concrete example for the Kirchenbauer watermark scheme, proving that the resulting test statistics are independent and pivotal under the null hypothesis of no watermark. revision: yes
Referee: [§3 (Theoretical Results)] Theoretical power results are stated to characterize the procedures, but the growth rate of the adaptive e-processes under alternatives (relative to non-adaptive baselines) needs an explicit bound or comparison theorem to substantiate the claimed enhancement; without it, the adaptivity benefit remains qualitative.

Authors: We agree that making the power enhancement explicit would be beneficial. We will include a new theorem in §3 that provides an explicit lower bound on the growth rate of the log of the adaptive e-process under alternatives, demonstrating a strict improvement over non-adaptive counterparts in terms of asymptotic power. revision: yes

Circularity Check

0 steps flagged

No significant circularity; framework extends established e-process theory

full rationale

The derivation chain reduces watermark detection to a standard independence testing problem under the explicit applicability condition that watermark schemes induce token-sequence dependence and independent pivotal statistics exist. This reduction is stated as a precondition rather than derived internally. The unified e-process framework and adaptive constructions are proposed as extensions of prior e-process literature, with power properties characterized theoretically in a separate step. No equations, fitted parameters, or self-citations are shown to force the central guarantees by construction; the adaptive methods enhance power within the given setting without redefining inputs as outputs. This is the most common honest non-finding for papers that apply established sequential testing tools to a new domain.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on standard properties of e-processes for sequential hypothesis testing and the modeling assumption that watermarking produces independent pivotal statistics; no free parameters or invented entities are introduced in the abstract description.

axioms (2)

standard math E-processes provide anytime-valid p-values for sequential testing under standard martingale conditions
Invoked when reducing watermark detection to online independence testing with anytime-valid guarantees.
domain assumption Watermark schemes induce dependence between tokens and pseudo-random sequence yielding pivotal statistics
Stated in the abstract as the basis for reducing detection to hypothesis testing on independence.

pith-pipeline@v0.9.0 · 5416 in / 1313 out tokens · 31424 ms · 2026-05-15T21:46:06.292997+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

reducing watermark detection to a hypothesis testing problem on independence... pivotal statistic Y=U_W... super-uniform under the alternative

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

When Should an AI Workflow Release? Always-Valid Inference for Black-Box Generate-Verify Systems
stat.ML 2026-05 unverdicted novelty 6.0

A wrapper for black-box generate-verify AI pipelines that uses a conservative hard-negative reference pool and e-processes to control the probability of releasing on infeasible tasks while permitting release on feasible ones.