How Do Data Owners Say No? A Case Study of Data Consent Mechanisms in Web-Scraped Vision-Language AI Training Datasets
Pith reviewed 2026-05-18 00:15 UTC · model grok-4.3
The pith
Many samples in large web-scraped AI datasets carry owner signals against use in training that current pipelines ignore.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The authors establish that data owners convey non-consent for AI scraping and training through multiple channels, including copyright notices attached to samples, prohibitions stated in website Terms of Service, and embedded watermarks. Their examination of CommonPool identifies at least 122 million samples exhibiting copyright notices, determines that 60 percent of samples from the top 50 domains originate from websites whose ToS forbid scraping, and estimates with 95 percent that 9-13 percent of samples contain watermarks that existing detection methods fail to capture in high fidelity. They conclude that current AI data collection pipelines do not entirely respect these signals and that a
What carries the argument
Multi-channel consent signal detection that combines sample-level checks for copyright notices, watermarks, and metadata with domain-level review of Terms of Service and robots exclusion protocols.
If this is right
- AI training pipelines should add filters or labels for copyright notices, ToS restrictions, and watermarks to avoid non-consensual data.
- Dataset curation for releases like DataComp should incorporate systematic checks for these owner signals before distribution.
- Developers training on such data face elevated copyright infringement risks when owner objections are present but unheeded.
- A unified data consent framework that explicitly addresses AI training purposes would standardize how owners can express and pipelines can honor preferences.
Where Pith is reading between the lines
- The same patterns of overlooked consent signals are likely present in other large web-scraped collections used for language or multimodal models.
- Practical tools that automatically parse ToS documents and improve watermark detection could be built to apply these checks at web scale.
- Training models exclusively on data with verified owner consent might alter performance characteristics, though this remains untested in the study.
Load-bearing premise
The presence of a copyright notice, a Terms of Service prohibition on scraping, or a detected watermark accurately reflects a data owner's intent to withhold permission for AI training use.
What would settle it
A direct survey of owners of the websites and images flagged in the study, asking whether they specifically object to AI training use, would falsify the central interpretation if most owners report that they permit such use despite the observed signals.
Figures
read the original abstract
The internet has become the main source of data to train modern text-to-image or vision-language models, yet it is increasingly unclear whether web-scale data collection practices for training AI systems adequately respect data owners' wishes. Ignoring the owner's indication of consent around data usage not only raises ethical concerns but also has recently been elevated into lawsuits around copyright infringement cases. In this work, we aim to reveal information about data owners' consent to AI scraping and training, and study how it's expressed in DataComp, a popular dataset of 12.8 billion text-image pairs. We examine both the sample-level information, including the copyright notice, watermarking, and metadata, and the web-domain-level information, such as a site's Terms of Service (ToS) and Robots Exclusion Protocol. We estimate at least 122M of samples exhibit some indication of copyright notice in CommonPool, and find that 60\% of the samples in the top 50 domains come from websites with ToS that prohibit scraping. Furthermore, we estimate 9-13\% with 95\% confidence interval of samples from CommonPool to contain watermarks, where existing watermark detection methods fail to capture them in high fidelity. Our holistic methods and findings show that data owners rely on various channels to convey data consent, of which current AI data collection pipelines do not entirely respect. These findings highlight the limitations of the current dataset curation/release practice and the need for a unified data consent framework taking AI purposes into consideration.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper conducts an empirical analysis of consent-related signals in the CommonPool dataset (part of DataComp, with 12.8 billion text-image pairs) used for vision-language model training. It reports that at least 122 million samples contain indications of copyright notices, that 60% of samples from the top 50 domains come from sites whose Terms of Service prohibit scraping, and that 9-13% (95% CI) of samples contain watermarks that existing detectors miss at high fidelity. The authors conclude that current web-scraping pipelines for AI training do not fully respect data owners' expressed consent preferences across sample-level and domain-level channels.
Significance. If the quantitative estimates hold after validation, the work provides concrete, large-scale evidence of the prevalence of non-consent signals in widely used training corpora. This directly informs debates on ethical data curation, copyright litigation, and the design of future datasets, while highlighting the need for unified consent frameworks that account for AI training uses.
major comments (3)
- [Methods / Detection Pipeline] The central estimates (≥122M samples with copyright indications; 9-13% watermark prevalence) depend on the accuracy of the OCR-based copyright extractor and the watermark detector. The manuscript does not report human-validated precision/recall on a held-out, representative sample drawn from CommonPool itself; without this, false positives from logos, decorative text, or compression artifacts could inflate the headline figures.
- [Domain-level Analysis] The ToS analysis for the top-50 domains interprets prohibitions on scraping as signals of non-consent for AI training. The classification criteria for specific clauses (e.g., how 'commercial use' or 'automated access' language is mapped to AI training) are not detailed, nor is inter-annotator agreement or sensitivity to legal ambiguity reported; this affects the 60% claim.
- [Watermark Results] The claim that 'existing watermark detection methods fail to capture them in high fidelity' underpins the 9-13% estimate and its confidence interval. The paper should provide quantitative failure-case analysis or direct comparison against the detectors used, rather than a qualitative statement.
minor comments (2)
- [Abstract and Results] Clarify the exact sampling procedure and confidence-interval methodology used for the 9-13% watermark estimate, including how domain clustering or image duplication was handled.
- [Abstract] Ensure consistent use of 'CommonPool' versus 'DataComp' throughout; the abstract switches between the two without explicit definition of the subset analyzed.
Simulated Author's Rebuttal
We thank the referee for their constructive and detailed comments, which help clarify areas where our presentation can be strengthened. We respond to each major comment below and commit to revisions that directly address the concerns while preserving the core empirical contributions of the work.
read point-by-point responses
-
Referee: [Methods / Detection Pipeline] The central estimates (≥122M samples with copyright indications; 9-13% watermark prevalence) depend on the accuracy of the OCR-based copyright extractor and the watermark detector. The manuscript does not report human-validated precision/recall on a held-out, representative sample drawn from CommonPool itself; without this, false positives from logos, decorative text, or compression artifacts could inflate the headline figures.
Authors: We agree that explicit human validation metrics on CommonPool samples would increase confidence in the headline estimates. The manuscript describes the OCR pipeline and watermark detection approach in Section 3 but does not include a dedicated human evaluation. In the revised version we will add a validation subsection that reports precision and recall from two independent annotators on a random sample of 1,000 CommonPool images, together with an error analysis of false-positive cases such as logos and compression artifacts. This addition will directly mitigate the risk of inflated figures. revision: yes
-
Referee: [Domain-level Analysis] The ToS analysis for the top-50 domains interprets prohibitions on scraping as signals of non-consent for AI training. The classification criteria for specific clauses (e.g., how 'commercial use' or 'automated access' language is mapped to AI training) are not detailed, nor is inter-annotator agreement or sensitivity to legal ambiguity reported; this affects the 60% claim.
Authors: We acknowledge that the current manuscript provides insufficient detail on the ToS classification rules and does not report inter-annotator agreement or sensitivity checks. We will revise the methods section to include a full classification codebook with explicit mappings of clauses (e.g., scraping bans and commercial-use restrictions) to non-consent for AI training, report inter-annotator agreement between the two coders who performed the labeling, and add a sensitivity analysis that tests alternative reasonable interpretations of ambiguous language. These changes will make the 60% figure more transparent and reproducible. revision: yes
-
Referee: [Watermark Results] The claim that 'existing watermark detection methods fail to capture them in high fidelity' underpins the 9-13% estimate and its confidence interval. The paper should provide quantitative failure-case analysis or direct comparison against the detectors used, rather than a qualitative statement.
Authors: We accept that a purely qualitative statement is insufficient to support the claim. The manuscript currently offers a qualitative assessment based on applying standard detectors to watermarked samples identified by our pipeline. In the revision we will add a quantitative comparison: we will evaluate several widely used watermark detectors on a held-out subset of images containing watermarks and report detection success rates in a new table. This will convert the statement into a data-driven failure-case analysis while retaining the 9-13% prevalence estimate. revision: yes
Circularity Check
No circularity: direct empirical measurements of web data signals
full rationale
This is an empirical case study that directly analyzes samples from CommonPool, detects copyright notices via OCR, estimates watermark prevalence, and reviews ToS and robots.txt on top domains. The abstract and described methods contain no equations, no fitted parameters presented as predictions, no derivations, and no self-citations invoked as load-bearing uniqueness theorems or ansatzes. All quantitative claims (≥122M samples, 60% ToS prohibition, 9-13% watermark interval) are presented as outputs of applied detection pipelines on external data rather than reductions to prior results by construction. The derivation chain is therefore self-contained and consists solely of measurement steps.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption Website Terms of Service that prohibit scraping indicate lack of consent for AI training data collection
- domain assumption Watermark presence can be estimated even when standard detection methods fail to capture them in high fidelity
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/AbsoluteFloorClosure.leanabsolute_floor_iff_bare_distinguishability unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We estimate at least 122M of samples exhibit some indication of copyright notice in CommonPool, and find that 60% of the samples in the top 50 domains come from websites with ToS that prohibit scraping.
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Furthermore, we estimate 9-13% with 95% confidence interval of samples from CommonPool to contain watermarks, where existing watermark detection methods fail to capture them in high fidelity.
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Bag of Tricks for Efficient Text Classification
Bag of Tricks for Efficient Text Classification.arXiv preprint arXiv:1607.01759. Kosmyna, N.; and Hauptmann, E. 2025. Humans Commons. https://www.humanscommons.org. Content licensed under Humans Commons AI0-BY-NC-ND-1.0. Kyi, L.; Mahuli, A.; Silberman, M. S.; Binns, R.; Zhao, J.; and Biega, A. J. 2025. Governance of Generative AI in Cre- ative Work: Conse...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[2]
on the training split of Felice Pollano (2019) com- prising 12,510 and 12,477 images for watermarked and non- watermarked images. The pre-trained MobileViTv2 (Mehta and Rastegari 2022) is loaded via Huggingface checkpoint apple/mobilevitv2-1.0-imagenet1k-256. We use Huggingface checkpoints for both Rolm-OCR and Gemma-3-12b-it, and we prompt the VLMs with:...
work page 2019
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.