pith. sign in

arxiv: 2511.08637 · v3 · submitted 2025-11-10 · 💻 cs.CY · cs.AI· cs.CR

How Do Data Owners Say No? A Case Study of Data Consent Mechanisms in Web-Scraped Vision-Language AI Training Datasets

Pith reviewed 2026-05-18 00:15 UTC · model grok-4.3

classification 💻 cs.CY cs.AIcs.CR
keywords data consentweb scrapingvision-language modelscopyright noticesterms of servicewatermarksAI training datasetsCommonPool
0
0 comments X

The pith

Many samples in large web-scraped AI datasets carry owner signals against use in training that current pipelines ignore.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper examines how data owners signal consent or its absence for their images to be used in training vision-language models. It studies the DataComp CommonPool collection of 12.8 billion text-image pairs by checking individual samples for copyright notices and watermarks and checking the top source websites for scraping bans in their Terms of Service. The analysis estimates at least 122 million samples with copyright indications, finds that 60 percent of samples from the top 50 domains come from sites whose ToS prohibit scraping, and estimates that 9 to 13 percent of samples contain watermarks that standard detectors miss. These results matter because they demonstrate that owner preferences expressed through ordinary web mechanisms are routinely bypassed in AI data collection. The work concludes that existing curation practices fall short and that a unified consent framework accounting for AI purposes is required.

Core claim

The authors establish that data owners convey non-consent for AI scraping and training through multiple channels, including copyright notices attached to samples, prohibitions stated in website Terms of Service, and embedded watermarks. Their examination of CommonPool identifies at least 122 million samples exhibiting copyright notices, determines that 60 percent of samples from the top 50 domains originate from websites whose ToS forbid scraping, and estimates with 95 percent that 9-13 percent of samples contain watermarks that existing detection methods fail to capture in high fidelity. They conclude that current AI data collection pipelines do not entirely respect these signals and that a

What carries the argument

Multi-channel consent signal detection that combines sample-level checks for copyright notices, watermarks, and metadata with domain-level review of Terms of Service and robots exclusion protocols.

If this is right

  • AI training pipelines should add filters or labels for copyright notices, ToS restrictions, and watermarks to avoid non-consensual data.
  • Dataset curation for releases like DataComp should incorporate systematic checks for these owner signals before distribution.
  • Developers training on such data face elevated copyright infringement risks when owner objections are present but unheeded.
  • A unified data consent framework that explicitly addresses AI training purposes would standardize how owners can express and pipelines can honor preferences.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same patterns of overlooked consent signals are likely present in other large web-scraped collections used for language or multimodal models.
  • Practical tools that automatically parse ToS documents and improve watermark detection could be built to apply these checks at web scale.
  • Training models exclusively on data with verified owner consent might alter performance characteristics, though this remains untested in the study.

Load-bearing premise

The presence of a copyright notice, a Terms of Service prohibition on scraping, or a detected watermark accurately reflects a data owner's intent to withhold permission for AI training use.

What would settle it

A direct survey of owners of the websites and images flagged in the study, asking whether they specifically object to AI training use, would falsify the central interpretation if most owners report that they permit such use despite the observed signals.

Figures

Figures reproduced from arXiv: 2511.08637 by Aster Plotnik, Chung Peng Lee, Harry H. Jiang, Jamie Morgenstern, Rachel Hong, William Agnew.

Figure 1
Figure 1. Figure 1: The life cycle of curating, releasing, and using the web-scraped VLD. Even though the [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Terms of Service annotations. The full population in each chart is all samples in the top 50 base domains of [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Each category has multiple regular expression pat [PITH_FULL_IMAGE:figures/full_fig_p010_3.png] view at source ↗
Figure 3
Figure 3. Figure 3: Regular expression search patterns used to source copyright notice in samples’ captions and OCR-extracted texts. [PITH_FULL_IMAGE:figures/full_fig_p011_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Distribution of the top 50 base domains in the [PITH_FULL_IMAGE:figures/full_fig_p012_4.png] view at source ↗
read the original abstract

The internet has become the main source of data to train modern text-to-image or vision-language models, yet it is increasingly unclear whether web-scale data collection practices for training AI systems adequately respect data owners' wishes. Ignoring the owner's indication of consent around data usage not only raises ethical concerns but also has recently been elevated into lawsuits around copyright infringement cases. In this work, we aim to reveal information about data owners' consent to AI scraping and training, and study how it's expressed in DataComp, a popular dataset of 12.8 billion text-image pairs. We examine both the sample-level information, including the copyright notice, watermarking, and metadata, and the web-domain-level information, such as a site's Terms of Service (ToS) and Robots Exclusion Protocol. We estimate at least 122M of samples exhibit some indication of copyright notice in CommonPool, and find that 60\% of the samples in the top 50 domains come from websites with ToS that prohibit scraping. Furthermore, we estimate 9-13\% with 95\% confidence interval of samples from CommonPool to contain watermarks, where existing watermark detection methods fail to capture them in high fidelity. Our holistic methods and findings show that data owners rely on various channels to convey data consent, of which current AI data collection pipelines do not entirely respect. These findings highlight the limitations of the current dataset curation/release practice and the need for a unified data consent framework taking AI purposes into consideration.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper conducts an empirical analysis of consent-related signals in the CommonPool dataset (part of DataComp, with 12.8 billion text-image pairs) used for vision-language model training. It reports that at least 122 million samples contain indications of copyright notices, that 60% of samples from the top 50 domains come from sites whose Terms of Service prohibit scraping, and that 9-13% (95% CI) of samples contain watermarks that existing detectors miss at high fidelity. The authors conclude that current web-scraping pipelines for AI training do not fully respect data owners' expressed consent preferences across sample-level and domain-level channels.

Significance. If the quantitative estimates hold after validation, the work provides concrete, large-scale evidence of the prevalence of non-consent signals in widely used training corpora. This directly informs debates on ethical data curation, copyright litigation, and the design of future datasets, while highlighting the need for unified consent frameworks that account for AI training uses.

major comments (3)
  1. [Methods / Detection Pipeline] The central estimates (≥122M samples with copyright indications; 9-13% watermark prevalence) depend on the accuracy of the OCR-based copyright extractor and the watermark detector. The manuscript does not report human-validated precision/recall on a held-out, representative sample drawn from CommonPool itself; without this, false positives from logos, decorative text, or compression artifacts could inflate the headline figures.
  2. [Domain-level Analysis] The ToS analysis for the top-50 domains interprets prohibitions on scraping as signals of non-consent for AI training. The classification criteria for specific clauses (e.g., how 'commercial use' or 'automated access' language is mapped to AI training) are not detailed, nor is inter-annotator agreement or sensitivity to legal ambiguity reported; this affects the 60% claim.
  3. [Watermark Results] The claim that 'existing watermark detection methods fail to capture them in high fidelity' underpins the 9-13% estimate and its confidence interval. The paper should provide quantitative failure-case analysis or direct comparison against the detectors used, rather than a qualitative statement.
minor comments (2)
  1. [Abstract and Results] Clarify the exact sampling procedure and confidence-interval methodology used for the 9-13% watermark estimate, including how domain clustering or image duplication was handled.
  2. [Abstract] Ensure consistent use of 'CommonPool' versus 'DataComp' throughout; the abstract switches between the two without explicit definition of the subset analyzed.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive and detailed comments, which help clarify areas where our presentation can be strengthened. We respond to each major comment below and commit to revisions that directly address the concerns while preserving the core empirical contributions of the work.

read point-by-point responses
  1. Referee: [Methods / Detection Pipeline] The central estimates (≥122M samples with copyright indications; 9-13% watermark prevalence) depend on the accuracy of the OCR-based copyright extractor and the watermark detector. The manuscript does not report human-validated precision/recall on a held-out, representative sample drawn from CommonPool itself; without this, false positives from logos, decorative text, or compression artifacts could inflate the headline figures.

    Authors: We agree that explicit human validation metrics on CommonPool samples would increase confidence in the headline estimates. The manuscript describes the OCR pipeline and watermark detection approach in Section 3 but does not include a dedicated human evaluation. In the revised version we will add a validation subsection that reports precision and recall from two independent annotators on a random sample of 1,000 CommonPool images, together with an error analysis of false-positive cases such as logos and compression artifacts. This addition will directly mitigate the risk of inflated figures. revision: yes

  2. Referee: [Domain-level Analysis] The ToS analysis for the top-50 domains interprets prohibitions on scraping as signals of non-consent for AI training. The classification criteria for specific clauses (e.g., how 'commercial use' or 'automated access' language is mapped to AI training) are not detailed, nor is inter-annotator agreement or sensitivity to legal ambiguity reported; this affects the 60% claim.

    Authors: We acknowledge that the current manuscript provides insufficient detail on the ToS classification rules and does not report inter-annotator agreement or sensitivity checks. We will revise the methods section to include a full classification codebook with explicit mappings of clauses (e.g., scraping bans and commercial-use restrictions) to non-consent for AI training, report inter-annotator agreement between the two coders who performed the labeling, and add a sensitivity analysis that tests alternative reasonable interpretations of ambiguous language. These changes will make the 60% figure more transparent and reproducible. revision: yes

  3. Referee: [Watermark Results] The claim that 'existing watermark detection methods fail to capture them in high fidelity' underpins the 9-13% estimate and its confidence interval. The paper should provide quantitative failure-case analysis or direct comparison against the detectors used, rather than a qualitative statement.

    Authors: We accept that a purely qualitative statement is insufficient to support the claim. The manuscript currently offers a qualitative assessment based on applying standard detectors to watermarked samples identified by our pipeline. In the revision we will add a quantitative comparison: we will evaluate several widely used watermark detectors on a held-out subset of images containing watermarks and report detection success rates in a new table. This will convert the statement into a data-driven failure-case analysis while retaining the 9-13% prevalence estimate. revision: yes

Circularity Check

0 steps flagged

No circularity: direct empirical measurements of web data signals

full rationale

This is an empirical case study that directly analyzes samples from CommonPool, detects copyright notices via OCR, estimates watermark prevalence, and reviews ToS and robots.txt on top domains. The abstract and described methods contain no equations, no fitted parameters presented as predictions, no derivations, and no self-citations invoked as load-bearing uniqueness theorems or ansatzes. All quantitative claims (≥122M samples, 60% ToS prohibition, 9-13% watermark interval) are presented as outputs of applied detection pipelines on external data rather than reductions to prior results by construction. The derivation chain is therefore self-contained and consists solely of measurement steps.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claims rest on interpretive assumptions about what legal and technical signals mean for AI-specific consent rather than direct owner statements or validated detection benchmarks.

axioms (2)
  • domain assumption Website Terms of Service that prohibit scraping indicate lack of consent for AI training data collection
    The paper treats ToS prohibitions as relevant to AI scraping without additional owner confirmation.
  • domain assumption Watermark presence can be estimated even when standard detection methods fail to capture them in high fidelity
    The 9-13% estimate is offered alongside the note that existing methods miss many cases.

pith-pipeline@v0.9.0 · 5591 in / 1488 out tokens · 42105 ms · 2026-05-18T00:15:43.528707+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

2 extracted references · 2 canonical work pages · 1 internal anchor

  1. [1]

    Bag of Tricks for Efficient Text Classification

    Bag of Tricks for Efficient Text Classification.arXiv preprint arXiv:1607.01759. Kosmyna, N.; and Hauptmann, E. 2025. Humans Commons. https://www.humanscommons.org. Content licensed under Humans Commons AI0-BY-NC-ND-1.0. Kyi, L.; Mahuli, A.; Silberman, M. S.; Binns, R.; Zhao, J.; and Biega, A. J. 2025. Governance of Generative AI in Cre- ative Work: Conse...

  2. [2]

    The pre-trained MobileViTv2 (Mehta and Rastegari 2022) is loaded via Huggingface checkpoint apple/mobilevitv2-1.0-imagenet1k-256

    on the training split of Felice Pollano (2019) com- prising 12,510 and 12,477 images for watermarked and non- watermarked images. The pre-trained MobileViTv2 (Mehta and Rastegari 2022) is loaded via Huggingface checkpoint apple/mobilevitv2-1.0-imagenet1k-256. We use Huggingface checkpoints for both Rolm-OCR and Gemma-3-12b-it, and we prompt the VLMs with:...