pith. machine review for the scientific record. sign in

arxiv: 2602.14286 · v4 · submitted 2026-02-15 · 📊 stat.ME · stat.ML

Recognition: 2 theorem links

· Lean Theorem

Online LLM watermark detection via e-processes

Authors on Pith no claims yet

Pith reviewed 2026-05-15 21:46 UTC · model grok-4.3

classification 📊 stat.ME stat.ML
keywords LLM watermark detectione-processesonline testingindependence testingadaptive e-processesanytime-valid inferencestatistical power analysis
0
0 comments X

The pith

E-processes provide a unified framework for online detection of watermarks in LLM-generated text with anytime-valid guarantees.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops a statistical framework for detecting watermarks in large language model outputs using e-processes. This approach converts the detection problem into testing for independence between tokens and a pseudo-random sequence. It offers methods to build adaptive e-processes that improve detection power while maintaining validity at any time during sequential testing. The framework applies to any setting with independent pivotal statistics and includes theoretical analysis of its power. Experiments show it performs competitively with existing methods.

Core claim

We develop a unified framework for LLM watermark detection based on e-processes, providing anytime-valid guarantees for online testing. We propose various methods to construct empirically adaptive e-processes that can enhance the detection power. The proposed methods are applicable to any sequential testing problem where independent pivotal statistics are available. Theoretical results characterize the power properties of the proposed procedures.

What carries the argument

E-processes, sequences of nonnegative random variables that are supermartingales under the null of independence, used to construct anytime-valid sequential tests for watermark detection.

Load-bearing premise

Watermark schemes reliably induce dependence between generated tokens and a pseudo-random sequence, allowing reduction to an independence testing problem with available independent pivotal statistics.

What would settle it

An experiment showing that the constructed e-process fails to reject the null on a known watermarked text stream or produces invalid p-values when the pivotal statistics violate independence.

read the original abstract

Watermarking for large language models (LLMs) has emerged as an effective tool for distinguishing AI-generated text from human-written content. Statistically, watermark schemes induce dependence between generated tokens and a pseudo-random sequence, reducing watermark detection to a hypothesis testing problem on independence. We develop a unified framework for LLM watermark detection based on e-processes, providing anytime-valid guarantees for online testing. We propose various methods to construct empirically adaptive e-processes that can enhance the detection power. The proposed methods are applicable to any sequential testing problem where independent pivotal statistics are available. In addition, theoretical results are established to characterize the power properties of the proposed procedures. Some experiments demonstrate that the proposed framework achieves competitive performance compared to existing watermark detection methods.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper develops a unified framework for LLM watermark detection based on e-processes, providing anytime-valid guarantees for online sequential testing. It reduces watermark detection to an independence testing problem under the assumption that watermark schemes induce dependence between tokens and a pseudo-random sequence, proposes methods for constructing empirically adaptive e-processes to improve detection power, establishes theoretical results on power properties, and demonstrates competitive experimental performance against existing methods. The framework is positioned as applicable to any sequential testing setting with independent pivotal statistics.

Significance. If the central reduction to independence testing holds and the adaptive e-process constructions deliver the claimed power gains while preserving anytime-validity, the work would provide a statistically rigorous online monitoring tool for AI-generated text, extending e-process theory to a timely application area. The emphasis on broad applicability and theoretical power characterizations strengthens its potential impact beyond LLM watermarking.

major comments (2)
  1. [§2 (Problem Setup and Reduction)] The central claim relies on the availability of independent pivotal statistics for e-process construction after the reduction to independence testing. The manuscript should explicitly verify this condition for standard watermark schemes (e.g., those inducing token-sequence dependence) with a concrete example or lemma showing that the resulting statistics remain independent and pivotal under the null.
  2. [§3 (Theoretical Results)] Theoretical power results are stated to characterize the procedures, but the growth rate of the adaptive e-processes under alternatives (relative to non-adaptive baselines) needs an explicit bound or comparison theorem to substantiate the claimed enhancement; without it, the adaptivity benefit remains qualitative.
minor comments (2)
  1. [§4 (Experiments)] In the experiments, report quantitative metrics such as empirical power at fixed type-I error levels and the number of tokens required for detection across multiple watermark schemes and text lengths to allow direct comparison with baselines.
  2. [§3 (Methods)] Clarify the precise definition of 'empirically adaptive' e-processes (e.g., how the adaptation is performed without using future data) in the methods section to avoid ambiguity for readers unfamiliar with e-process literature.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their positive evaluation and insightful comments. We address each major comment below and will incorporate the suggested changes in the revised manuscript.

read point-by-point responses
  1. Referee: [§2 (Problem Setup and Reduction)] The central claim relies on the availability of independent pivotal statistics for e-process construction after the reduction to independence testing. The manuscript should explicitly verify this condition for standard watermark schemes (e.g., those inducing token-sequence dependence) with a concrete example or lemma showing that the resulting statistics remain independent and pivotal under the null.

    Authors: We appreciate this suggestion. Upon review, we note that the reduction to independence testing is standard for watermark schemes that introduce dependence between tokens and the pseudo-random sequence. In the revised version, we will add a lemma in §2 with a concrete example for the Kirchenbauer watermark scheme, proving that the resulting test statistics are independent and pivotal under the null hypothesis of no watermark. revision: yes

  2. Referee: [§3 (Theoretical Results)] Theoretical power results are stated to characterize the procedures, but the growth rate of the adaptive e-processes under alternatives (relative to non-adaptive baselines) needs an explicit bound or comparison theorem to substantiate the claimed enhancement; without it, the adaptivity benefit remains qualitative.

    Authors: We agree that making the power enhancement explicit would be beneficial. We will include a new theorem in §3 that provides an explicit lower bound on the growth rate of the log of the adaptive e-process under alternatives, demonstrating a strict improvement over non-adaptive counterparts in terms of asymptotic power. revision: yes

Circularity Check

0 steps flagged

No significant circularity; framework extends established e-process theory

full rationale

The derivation chain reduces watermark detection to a standard independence testing problem under the explicit applicability condition that watermark schemes induce token-sequence dependence and independent pivotal statistics exist. This reduction is stated as a precondition rather than derived internally. The unified e-process framework and adaptive constructions are proposed as extensions of prior e-process literature, with power properties characterized theoretically in a separate step. No equations, fitted parameters, or self-citations are shown to force the central guarantees by construction; the adaptive methods enhance power within the given setting without redefining inputs as outputs. This is the most common honest non-finding for papers that apply established sequential testing tools to a new domain.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on standard properties of e-processes for sequential hypothesis testing and the modeling assumption that watermarking produces independent pivotal statistics; no free parameters or invented entities are introduced in the abstract description.

axioms (2)
  • standard math E-processes provide anytime-valid p-values for sequential testing under standard martingale conditions
    Invoked when reducing watermark detection to online independence testing with anytime-valid guarantees.
  • domain assumption Watermark schemes induce dependence between tokens and pseudo-random sequence yielding pivotal statistics
    Stated in the abstract as the basis for reducing detection to hypothesis testing on independence.

pith-pipeline@v0.9.0 · 5416 in / 1313 out tokens · 31424 ms · 2026-05-15T21:46:06.292997+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. When Should an AI Workflow Release? Always-Valid Inference for Black-Box Generate-Verify Systems

    stat.ML 2026-05 unverdicted novelty 6.0

    A wrapper for black-box generate-verify AI pipelines that uses a conservative hard-negative reference pool and e-processes to control the probability of releasing on infeasible tasks while permitting release on feasible ones.