pith. sign in

arxiv: 2506.17185 · v2 · submitted 2025-06-20 · 💻 cs.CR · cs.CY

A Common Pool of Privacy Problems: Legal and Technical Lessons from a Large-Scale Web-Scraped Machine Learning Dataset

Pith reviewed 2026-05-19 08:15 UTC · model grok-4.3

classification 💻 cs.CR cs.CY
keywords web-scraped datasetspersonally identifiable informationprivacy lawsmachine learning training datadata sanitizationlegal risksAI data curationpublicly available data
0
0 comments X

The pith

Web-scraped datasets for AI training contain significant personally identifiable information despite sanitization.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper conducts an empirical audit of a popular large-scale web-scraped dataset used to train machine learning models. It identifies substantial amounts of legally defined personal data that remain even after cleaning steps intended to remove such content. This finding grounds a legal analysis showing how current curation practices can expose developers and users to privacy violations under existing laws. The authors use the concrete case to question broad assumptions about publicly available internet data. They call for changes in how such data is treated to reduce risks in downstream AI systems.

Core claim

An audit of one popular web-scraped machine learning dataset finds significant personally identifiable information persisting after sanitization. This supplies concrete evidence that large-scale web-scraped corpora may contain legally defined personal data, which can then propagate into models trained on them. The empirical results inform an analysis of risks under privacy and data protection laws and support the argument that frameworks treating internet content as freely available for AI training should be reoriented to impose meaningful limits on indiscriminate scraping.

What carries the argument

The combination of a targeted privacy audit of a real-world sanitized dataset with legal analysis of how personal data in training sets interacts with existing privacy statutes.

If this is right

  • Models trained on such data can embed and later disclose personal information from the original sources.
  • Organizations that compile or use these datasets face concrete exposure under data-protection regulations.
  • Current cleaning pipelines do not reliably prevent personal data from reaching downstream AI applications.
  • Legal exposure can extend beyond the original curators to any party that trains or deploys models on the data.
  • Redefining the boundary of publicly available information would constrain how future datasets are assembled.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Similar privacy leakage patterns may appear in other large web corpora even when their curators claim different cleaning approaches.
  • Shifting to narrower, purpose-collected data sources could become a practical way to reduce legal surface area for AI developers.
  • Courts or regulators might treat the documented presence of personal data as evidence in future enforcement actions against scraped-data models.
  • Systematic cross-dataset comparisons would test whether the observed issues are isolated or structural.

Load-bearing premise

The single audited dataset and its particular sanitization steps stand in for the full class of large-scale web-scraped machine learning corpora.

What would settle it

A follow-up audit of several other major web-scraped datasets that applies comparable cleaning methods and finds no measurable personal data.

Figures

Figures reproduced from arXiv: 2506.17185 by Imaad Huda, Jamie Morgenstern, Jevan Hutson, Rachel Hong, Tadayoshi Kohno, William Agnew.

Figure 1
Figure 1. Figure 1: Data lifecycle of how personal information appears [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: A high-level depiction of how personal information [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Number of annotated samples that link a name [PITH_FULL_IMAGE:figures/full_fig_p011_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Examples of identifying sociodemographic information found in CommonPool’s small scale dataset. For each sample, [PITH_FULL_IMAGE:figures/full_fig_p012_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Examples of identity-related documents found in CommonPool’s small scale dataset, showing a credit card, social [PITH_FULL_IMAGE:figures/full_fig_p013_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Sample counts of annotated personal information present in the 168 resume documents with validated online presence, [PITH_FULL_IMAGE:figures/full_fig_p014_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Examples of resume documents and personal disclosures found in CommonPool’s small scale dataset. For each sample, [PITH_FULL_IMAGE:figures/full_fig_p014_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Real examples of children’s information found in CommonPool’s small scale dataset. For each sample, the type of [PITH_FULL_IMAGE:figures/full_fig_p016_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Error breakdown of most common websites of which all samples failed to download for the [PITH_FULL_IMAGE:figures/full_fig_p018_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Sample counts of non-empty Exif tags relating to [PITH_FULL_IMAGE:figures/full_fig_p018_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: The stakeholder network demonstrates the potential flow of personal information between actors in the Internet [PITH_FULL_IMAGE:figures/full_fig_p033_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Example CommonPool images that contain text [PITH_FULL_IMAGE:figures/full_fig_p034_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Word visualizations of captions (without stop words) of a 1 million random subsample of CommonPool. [PITH_FULL_IMAGE:figures/full_fig_p035_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Word visualizations of OCR-extracted text (without stop words) of a 1 million random subsample of CommonPool. [PITH_FULL_IMAGE:figures/full_fig_p035_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: Bigram disk visualizations in the caption and OCR-extracted text of 1 million random subsamples of CommonPool. [PITH_FULL_IMAGE:figures/full_fig_p035_15.png] view at source ↗
Figure 16
Figure 16. Figure 16: Top 50 most common celebrities from Pantheon 2020 dataset [ [PITH_FULL_IMAGE:figures/full_fig_p036_16.png] view at source ↗
Figure 17
Figure 17. Figure 17: Additional bar graphs from searching Pantheon 2020 celebrity names [ [PITH_FULL_IMAGE:figures/full_fig_p036_17.png] view at source ↗
Figure 18
Figure 18. Figure 18: Top 50 most common Presidio-detected names from CommonPool captions and OCR-extracted text. [PITH_FULL_IMAGE:figures/full_fig_p036_18.png] view at source ↗
Figure 19
Figure 19. Figure 19: Sample count breakdown of country of address disclosed by validated resumes. [PITH_FULL_IMAGE:figures/full_fig_p037_19.png] view at source ↗
Figure 20
Figure 20. Figure 20: Sample count breakdown of national origin or citizenship disclosed by validated resumes. [PITH_FULL_IMAGE:figures/full_fig_p037_20.png] view at source ↗
Figure 21
Figure 21. Figure 21: Earliest timestamp of URLs of validated resumes according to Internet Archive’s Wayback Machine [ [PITH_FULL_IMAGE:figures/full_fig_p037_21.png] view at source ↗
Figure 22
Figure 22. Figure 22: Website frequency of children-related information. [PITH_FULL_IMAGE:figures/full_fig_p037_22.png] view at source ↗
Figure 23
Figure 23. Figure 23: Sample counts of top 15 most common HTTP errors for images that failed to download during a download version [PITH_FULL_IMAGE:figures/full_fig_p038_23.png] view at source ↗
Figure 24
Figure 24. Figure 24: Sample counts by year of earliest timestamps according to the Wayback Machine records for a random subsample of [PITH_FULL_IMAGE:figures/full_fig_p038_24.png] view at source ↗
Figure 25
Figure 25. Figure 25: Download error rate for samples grouped by regular expression matches to instances of personal information. The [PITH_FULL_IMAGE:figures/full_fig_p038_25.png] view at source ↗
Figure 26
Figure 26. Figure 26: Analysis of website URLs of manually confirmed images of faces not caught by SCRFD. [PITH_FULL_IMAGE:figures/full_fig_p039_26.png] view at source ↗
read the original abstract

We investigate the contents of web-scraped data for training AI systems, at sizes where human dataset curators and compilers no longer manually annotate every sample. Building off of prior privacy concerns in machine learning models, we ask: What are the legal privacy implications of web-scraped machine learning datasets? In an empirical study of a popular training dataset, we find significant presence of personally identifiable information despite sanitization efforts. Our audit provides concrete evidence to support the concern that any large-scale web-scraped dataset may contain legally defined personal data. We use these findings of a real-world dataset to inform our legal analysis with respect to existing privacy and data protection laws. We surface various legal risks of current data curation practices that may propagate personal information to train downstream models. Based on our empirical and legal analyses, we argue for reorientation of current frameworks of "publicly available" information to meaningfully limit the development of AI built upon indiscriminate scraping of the internet.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper conducts an empirical audit of a popular large-scale web-scraped ML training dataset and reports the presence of personally identifiable information (PII) despite prior sanitization. It leverages these findings to analyze risks under privacy and data protection laws, arguing that indiscriminate web scraping propagates personal data into downstream models and calling for a reorientation of legal frameworks treating such data as 'publicly available.'

Significance. If the audit methodology and generalization hold, the work offers a valuable bridge between technical dataset analysis and legal privacy scholarship in AI, providing concrete examples that could inform policy on data curation practices. The explicit combination of an empirical study with downstream legal implications is a constructive contribution, though its impact depends on addressing the scope of the single-dataset evidence.

major comments (2)
  1. [Abstract and §4] Abstract and §4 (Empirical Study): The claim of 'significant presence of personally identifiable information despite sanitization efforts' is asserted without reported quantitative details on audit methodology, sample size, false-positive rates, or evaluation criteria for the sanitization process. This directly weakens the evidential basis for the central generalization.
  2. [§5 and Conclusion] §5 (Legal Analysis) and Conclusion: The inference that findings from one audited dataset support the claim that 'any large-scale web-scraped dataset may contain legally defined personal data' requires demonstration that the dataset's scraping sources, scale, and cleaning pipeline are representative or not atypically lax relative to other corpora (e.g., those with additional deduplication steps). Without this, the legal-risk conclusions for the broader class do not follow.
minor comments (2)
  1. [Results tables] Table 1 or equivalent results summary: Clarify the exact PII categories detected and their prevalence to improve readability for non-technical legal readers.
  2. [Related Work] Related work section: Add explicit comparison to prior audits of Common Crawl-derived datasets to better situate the novelty of the sanitization evaluation.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback. We address each major comment below and indicate the revisions we will make to improve the clarity and evidential support of the manuscript.

read point-by-point responses
  1. Referee: [Abstract and §4] Abstract and §4 (Empirical Study): The claim of 'significant presence of personally identifiable information despite sanitization efforts' is asserted without reported quantitative details on audit methodology, sample size, false-positive rates, or evaluation criteria for the sanitization process. This directly weakens the evidential basis for the central generalization.

    Authors: We agree that the empirical audit section requires greater transparency to support our claims. In the revised manuscript we will expand §4 with quantitative details, including the total number of samples audited, the precise PII detection methodology and tools employed, results from any manual verification used to estimate false-positive rates, and explicit criteria for assessing the prior sanitization process. These additions will strengthen the evidential foundation without altering the core findings. revision: yes

  2. Referee: [§5 and Conclusion] §5 (Legal Analysis) and Conclusion: The inference that findings from one audited dataset support the claim that 'any large-scale web-scraped dataset may contain legally defined personal data' requires demonstration that the dataset's scraping sources, scale, and cleaning pipeline are representative or not atypically lax relative to other corpora (e.g., those with additional deduplication steps). Without this, the legal-risk conclusions for the broader class do not follow.

    Authors: We acknowledge the referee's concern about generalizability. The audited dataset is a widely adopted example that reflects common web-scraping and sanitization practices found across many large-scale ML corpora. In revision we will add a dedicated paragraph in §5 comparing its sources, scale, and cleaning steps to other prominent datasets and citing related reports of PII leakage in the literature. We maintain, however, that the legal analysis centers on risks inherent to the widespread practice of indiscriminate web scraping rather than on proving every possible corpus is identical; the single concrete case study is offered as illustrative evidence of those risks. We will revise the conclusion to clarify this scope while preserving the policy recommendations. revision: partial

Circularity Check

0 steps flagged

No significant circularity; empirical audit provides independent evidence

full rationale

The paper conducts a direct empirical audit of one large-scale web-scraped dataset to identify personally identifiable information that persists after sanitization. This observation is external to the paper's own claims and is not derived from any self-referential definitions, fitted parameters, or prior results by the same authors. The legal analysis then applies existing privacy frameworks to the observed risks without reducing the central claim to an input by construction. No equations, uniqueness theorems, ansatzes, or self-citation chains are load-bearing in the derivation. The generalization from the audited example to 'any large-scale web-scraped dataset' is an inductive inference whose strength can be debated on representativeness grounds, but that does not create circularity under the specified criteria. The work is self-contained against external benchmarks via the dataset inspection itself.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that the presence of PII in one audited corpus supports a general statement about all large-scale web-scraped datasets and that current legal definitions of personal data apply directly to the observed instances.

axioms (1)
  • domain assumption The examined dataset and its sanitization steps are representative of typical industry web-scraped corpora.
    Invoked to generalize from the single audit to the broader claim about 'any large-scale web-scraped dataset'.

pith-pipeline@v0.9.0 · 5720 in / 1343 out tokens · 55419 ms · 2026-05-19T08:15:39.148637+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 4 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Channel-Level Semantic Perturbations: Unlearnable Examples for Diverse Training Paradigms

    cs.LG 2026-04 unverdicted novelty 7.0

    Unlearnable examples fail under pretraining-finetuning due to semantic filtering by frozen layers, but Shallow Semantic Camouflage restores effectiveness by confining perturbations to semantically valid subspaces.

  2. How Can AI Augment Access to Justice? Public Defenders' Perspectives on AI Adoption

    cs.CY 2025-10 accept novelty 7.0

    Public defenders view AI as most useful for evidence investigation but limited in courtroom work and strategy, with adoption blocked by costs, confidentiality risks, and norms, requiring human oversight and open development.

  3. Security Considerations for Multi-agent Systems

    cs.CR 2026-03 unverdicted novelty 6.0

    No existing AI security framework covers a majority of the 193 identified multi-agent system threats in any category, with OWASP Agentic Security Initiative achieving the highest overall coverage at 65.3%.

  4. Cyclic Adaptive Private Synthesis for Sharing Real-World Data in Education

    cs.CY 2026-02 unverdicted novelty 6.0

    CAPS provides an iterative differentially private synthesis method that outperforms one-shot baselines on authentic educational real-world data.

Reference graph

Works this paper leans on

149 extracted references · 149 canonical work pages · cited by 4 Pith papers · 6 internal anchors

  1. [1]

    CA Civ Code § 1798.192. 2018. California Consumer Privacy Act of 2018. https://leginfo.legislature.ca.gov/faces/codes_displayText.xhtml? division=3.&part=4.&lawCode=CIV&title=1.81.5

  2. [2]

    OR SB 619. 2018. Oregon Consumer Privacy Act of 2018. https: //olis.oregonlegislature.gov/liz/2023R1/Downloads/MeasureDocument/ SB619/Enrolled

  3. [3]

    15 U.S.C. § 6501. 1998. Children’s Online Privacy Protection Act of

  4. [4]

    https://uscode.house.gov/view.xhtml?req=granuleid%3AUSC-prelim- title15-section6501&edition=prelim

  5. [5]

    Lura Abbott and Christine Grady. 2011. A systematic review of the empirical literature evaluating IRBs: What we know and what we still need to learn. Journal of Empirical Research on Human Research Ethics 6, 1 (2011), 3–19

  6. [6]

    Adobe. 2025. Content Credentials overview. https://helpx.adobe.com/creative- cloud/help/content-credentials.html

  7. [7]

    Stability AI. 2025. https://stability.ai/news/stable-diffusion-public-release

  8. [8]

    Spawning AI. 2025. Spawning API. https://api.spawning.ai/spawning-api

  9. [9]

    Jerone Andrews, Dora Zhao, William Thong, Apostolos Modas, Orestis Pa- pakyriakopoulos, and Alice Xiang. 2023. Ethical considerations for responsible data curation. Advances in Neural Information Processing Systems 36 (2023), 55320–55360

  10. [10]

    Internet Archive. 2013. Wayback Machine APIs. https://archive.org/help/ wayback_api.php

  11. [11]

    Rosana Ardila, Megan Branson, Kelly Davis, Michael Henretty, Michael Kohler, Josh Meyer, Reuben Morais, Lindsay Saunders, Francis M Tyers, and Gregor Weber. 2019. Common Voice: A massively-multilingual speech corpus. arXiv preprint arXiv:1912.06670 (2019). 29 Hong et al

  12. [12]

    Romain Beaumont. 2022. Clip Retrieval: Easily compute clip embeddings and build a clip retrieval system with them. https://github.com/rom1504/clip- retrieval

  13. [13]

    Yoav Benjamini and Daniel Yekutieli. 2001. The control of the false discovery rate in multiple testing under dependency.Annals of statistics (2001), 1165–1188

  14. [14]

    Abeba Birhane, Sanghyun Han, Vishnu Boddeti, Sasha Luccioni, et al . 2024. Into the LAION’s den: Investigating hate in multimodal datasets. Advances in Neural Information Processing Systems 36 (2024)

  15. [15]

    Abeba Birhane, Pratyusha Kalluri, Dallas Card, William Agnew, Ravit Dotan, and Michelle Bao. 2022. The values encoded in machine learning research. In Proceedings of the 2022 ACM Conference on Fairness, Accountability, and Transparency. 173–184

  16. [16]

    Abeba Birhane, Vinay Uday Prabhu, and Emmanuel Kahembwe. 2021. Mul- timodal datasets: misogyny, pornography, and malignant stereotypes. arXiv preprint arXiv:2110.01963 (2021)

  17. [17]

    Abeba Birhane, Ryan Steed, Victor Ojewale, Briana Vecchione, and In- ioluwa Deborah Raji. 2024. SoK: AI Auditing: The Broken Bus on the Road to AI Accountability. In 2nd IEEE Conference on Secure and Trustworthy Machine Learning

  18. [18]

    Rishi Bommasani, Kathleen A Creel, Ananya Kumar, Dan Jurafsky, and Percy S Liang. 2022. Picking on the same person: Does algorithmic monoculture lead to outcome homogenization? Advances in Neural Information Processing Systems 35 (2022), 3663–3678

  19. [19]

    Rishi Bommasani, Drew A Hudson, Ehsan Adeli, Russ Altman, Simran Arora, Sydney von Arx, Michael S Bernstein, Jeannette Bohg, Antoine Bosselut, Emma Brunskill, et al. 2021. On the opportunities and risks of foundation models. arXiv preprint arXiv:2108.07258 (2021)

  20. [20]

    Jaydeep Borkar, Matthew Jagielski, Katherine Lee, Niloofar Mireshghallah, David A Smith, and Christopher A Choquette-Choo. 2025. Privacy Ripple Effects from Adding or Removing Personal Information in Language Model Training. arXiv preprint arXiv:2502.15680 (2025)

  21. [21]

    Gavin Brown, Mark Bun, Vitaly Feldman, Adam Smith, and Kunal Talwar. 2021. When is memorization of irrelevant training data necessary for high-accuracy learning?. In Proceedings of the 53rd annual ACM SIGACT symposium on theory of computing. 123–132

  22. [22]

    Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. 2020. Language models are few-shot learners. Advances in Neural Information Processing Systems 33 (2020), 1877–1901

  23. [23]

    Amy Bruckman. 2002. Studying the amateur artist: A perspective on disguising data collected in human subjects research on the Internet.Ethics and Information Technology 4 (2002), 217–231

  24. [24]

    Ben Caldwell, Michael Cooper, Loretta Guarino Reid, Gregg Vanderheiden, Wendy Chisholm, John Slatin, and Jason White. 2008. Web content accessibility guidelines (WCAG) 2.0. WWW Consortium (W3C) 290, 1-34 (2008), 5–12

  25. [25]

    Yihan Cao, Siyu Li, Yixin Liu, Zhiling Yan, Yutong Dai, Philip S Yu, and Lichao Sun. 2023. A comprehensive survey of ai-generated content (aigc): A history of generative ai from gan to chatgpt. arXiv preprint arXiv:2303.04226 (2023)

  26. [26]

    Nicolas Carlini, Jamie Hayes, Milad Nasr, Matthew Jagielski, Vikash Sehwag, Florian Tramer, Borja Balle, Daphne Ippolito, and Eric Wallace. 2023. Extract- ing training data from diffusion models. In 32nd USENIX Security Symposium (USENIX Security 23). 5253–5270

  27. [27]

    Yunzhuo Chen, Nur Al Hasan Haldar, Naveed Akhtar, and Ajmal Mian. 2023. Text-image guided Diffusion Model for generating Deepfake celebrity interac- tions. In 2023 International Conference on Digital Image Computing: Techniques and Applications (DICTA). IEEE, 348–355

  28. [28]

    Cloudflare. 2024. Cloudflare API v4 documentation: Get multiple domain de- tails. https://developers.cloudflare.com/api/operations/domain-intelligenceget- multiple-domain-details

  29. [29]

    Samantha Cole. 2023. Largest dataset powering AI images removed after dis- covery of child sexual abuse material. 404 Media 20 (2023)

  30. [30]

    Federal Trade Commission. 2025. COPPA Safe Harbor Program. https://www. ftc.gov/enforcement/coppa-safe-harbor-program

  31. [31]

    Creative Commons. 2025. CC BY 4.0. https://creativecommons.org/licenses/ by/4.0/deed.en

  32. [32]

    Danish Contractor, Daniel McDuff, Julia Katherine Haines, Jenny Lee, Christo- pher Hines, Brent Hecht, Nicholas Vincent, and Hanlin Li. 2022. Behavioral use licensing for responsible AI. In Proceedings of the 2022 ACM Conference on Fairness, Accountability, and Transparency. 778–788

  33. [33]

    Common Crawl. 2025. Common Crawl. https://commoncrawl.org

  34. [34]

    Common Crawl. 2025. Frequently asked questions. https://commoncrawl.org/ faq

  35. [35]

    DataComp. 2023. DataComp. https://github.com/mlfoundations/datacomp

  36. [36]

    DataComp. 2023. Is there overlap between common-pool and laion-5B? https: //github.com/mlfoundations/datacomp/issues/19

  37. [37]

    Emily Denton, Alex Hanna, Razvan Amironesei, Andrew Smart, and Hilary Nicole. 2021. On the genealogy of machine learning datasets: A critical history of ImageNet. Big Data & Society 8, 2 (2021), 20539517211035955

  38. [38]

    Meera A Desai, Irene V Pasquetto, Abigail Z Jacobs, and Dallas Card. 2024. An archival perspective on pretraining data. Patterns 5, 4 (2024)

  39. [39]

    Mark Díaz, Sunipa Dev, Emily Reif, Emily Denton, and Vinodkumar Prab- hakaran. 2024. SoUnD Framework: Analyzing (So) cial Representation in (Un) structured (D) ata. In Proceedings of the AAAI/ACM Conference on AI, Ethics, and Society, Vol. 7. 371–383

  40. [40]

    Jesse Dodge, Maarten Sap, Ana Marasović, William Agnew, Gabriel Ilharco, Dirk Groeneveld, Margaret Mitchell, and Matt Gardner. 2021. Documenting large webtext corpora: A case study on the colossal clean crawled corpus.arXiv preprint arXiv:2104.08758 (2021)

  41. [41]

    Yao Dou, Isadora Krsek, Tarek Naous, Anubha Kabra, Sauvik Das, Alan Ritter, and Wei Xu. 2023. Reducing Privacy Risks in Online Self-Disclosures with Language Models. arXiv preprint arXiv:2311.09538 (2023)

  42. [42]

    Brianna Dym and Casey Fiesler. 2020. Ethical and Privacy Considerations for Research Using Online Fandom Data. Transformative works and cultures 33 (2020)

  43. [43]

    EasyOCR. 2025. EasyOCR. https://www.jaided.ai/easyocr/

  44. [44]

    Benj Edwards. 2022. Artist finds private medical record photos in popular AI training data set. https://arstechnica.com/information-technology/2022/09/ artist-finds-private-medical-record-photos-in-popular-ai-training-data-set/

  45. [45]

    European Parliament and Council of the European Union. 2016. Regulation (EU) 2016/679 of the European Parliament and of the Council . https://data.europa.eu/ eli/reg/2016/679/oj

  46. [46]

    Hugging Face. 2025. https://huggingface.co/api/datasets/mlfoundations/ datacomp_pools?expand%5B%5D=downloads&expand%5B%5D= downloadsAllTime

  47. [47]

    Participant

    Casey Fiesler and Nicholas Proferes. 2018. “Participant” perceptions of Twitter research ethics. Social Media+ Society 4, 1 (2018), 2056305118763366

  48. [48]

    Samir Yitzhak Gadre, Gabriel Ilharco, Alex Fang, Jonathan Hayase, Georgios Smyrnis, Thao Nguyen, Ryan Marten, Mitchell Wortsman, Dhruba Ghosh, Jieyu Zhang, et al. 2024. Datacomp: In search of the next generation of multimodal datasets. Advances in Neural Information Processing Systems 36 (2024)

  49. [49]

    Dilrukshi Gamage, Dilki Sewwandi, Min Zhang, and Arosha K Bandara. 2025. Labeling Synthetic Content: User Perceptions of Label Designs for AI-Generated Content on Social Media. In Proceedings of the 2025 CHI Conference on Human Factors in Computing Systems . 1–29

  50. [50]

    Prakhar Ganesh, Cuong Tran, Reza Shokri, and Ferdinando Fioretto. 2024. The data minimization principle in machine learning.arXiv preprint arXiv:2405.19471 (2024)

  51. [51]

    Timnit Gebru, Jamie Morgenstern, Briana Vecchione, Jennifer Wortman Vaughan, Hanna Wallach, Hal Daumé Iii, and Kate Crawford. 2021. Datasheets for datasets. Commun. ACM 64, 12 (2021), 86–92

  52. [52]

    Github. 2023. img2dataset ignores X-Robots-Tag. https://github.com/rom1504/ img2dataset/issues/298

  53. [53]

    Github. 2023. Implement Robots.txt support. https://github.com/rom1504/ img2dataset/issues/48

  54. [54]

    Github. 2023. Metadata download error - OSError: Consistency check failed. https://github.com/mlfoundations/datacomp/issues/33

  55. [55]

    Abigail Goldsteen, Gilad Ezov, Ron Shmelkin, Micha Moffie, and Ariel Farkash

  56. [56]

    AI and Ethics 2, 3 (2022), 477–491

    Data minimization for GDPR compliance in machine learning models. AI and Ethics 2, 3 (2022), 477–491

  57. [57]

    Jia Guo, Jiankang Deng, Alexandros Lattas, and Stefanos Zafeiriou. 2021. Sample and computation redistribution for efficient face detection. arXiv preprint arXiv:2105.04714 (2021)

  58. [58]

    Ritwik Gupta, Leah Walker, Rodolfo Corona, Stephanie Fu, Suzanne Petryk, Janet Napolitano, Trevor Darrell, and Andrew W Reddie. 2024. Data-Centric AI Governance: Addressing the Limitations of Model-Focused Policies. arXiv preprint arXiv:2409.17216 (2024)

  59. [59]

    Jack Hardinges, Elena Simperl, and Nigel Shadbolt. 2024. We must fix the lack of transparency around the data used to train foundation models. Harvard Data Science Review (Special Issue 5). https://doi. org/10.1162/99608f92. a50ec6e6 (2024)

  60. [60]

    Woodrow Hartzog. 2018. The case against idealising control. Eur. Data Prot. L. Rev. 4 (2018), 423

  61. [61]

    Woodrow Hartzog. 2019. The Public Information Fallacy. BUL Rev. 99 (2019), 459

  62. [62]

    Woodrow Hartzog and Evan Selinger. 2015. Surveillance as loss of obscurity. Wash. & Lee L. Rev. 72 (2015), 1343

  63. [63]

    Carol A Heimer and JuLeigh Petty. 2010. Bureaucratic ethics: IRBs and the legal regulation of human subjects research. Annual Review of Law and Social Science 6, 1 (2010), 601–626

  64. [64]

    Benjamin Henne, Maximilian Koch, and Matthew Smith. 2014. On the aware- ness, control and privacy of shared photo metadata. In International Conference on Financial Cryptography and Data Security . Springer, 77–88

  65. [65]

    Dennis D Hirsch. 2020. From Individual Control to Social Protection: New Paradigms for Privacy Law in the Age of Predictive Analytics’(2020). Md L Rev 79 (2020), 439

  66. [66]

    Rachel Hong, William Agnew, Tadayoshi Kohno, and Jamie Morgenstern. 2024. Who’s in and who’s out? A case study of multimodal CLIP-filtering in DataComp. 30 A Common Pool of Privacy Problems: Legal and Technical Lessons from a Large-Scale Web-Scraped Machine Learning Dataset In Proceedings of the 4th ACM Conference on Equity and Access in Algorithms, Mecha...

  67. [67]

    Stop Model!

    Jevan Hutson and Ben Winters. 2024. America’s next" Stop Model!" Model Deletion. Geo. L. Tech. Rev. 8 (2024), 124

  68. [68]

    ICO. 2025. Overview – Data Protection and the EU. https://ico.org.uk/for- organisations/data-protection-and-the-eu/overview-data-protection-and- the-eu/

  69. [69]

    iKeepSafe. 2025. Certified Products. https://ikeepsafe.org/products/#coppa

  70. [70]

    Mehtab Khan and Alex Hanna. 2022. The subjects and stages of ai dataset development: A framework for dataset accountability. Ohio St. Tech. LJ 19 (2022), 171

  71. [71]

    kidSAFE. [n. d.]. kidSAFE Seal Program Member List. https://www.kidsafeseal. com/certifiedproducts.html

  72. [72]

    Tadayoshi Kohno, Yasemin Acar, and Wulf Loh. 2023. Ethical frameworks and computer security trolley problems: Foundations for conversations. In 32nd USENIX Security Symposium (USENIX Security 23) . 5145–5162

  73. [73]

    LAION. 2025. Privacy Policy. https://laion.ai/privacy-policy/

  74. [74]

    LAION. 2025. Releasing RE-LAION 5B: Transparent iteration on LAION-5B with additional safety fixes. https://laion.ai/blog/relaion-5b/

  75. [75]

    Pierre-Carl Langlais, Carlos Rosas Hinostroza, Mattia Nee, Catherine Arnett, Pavel Chizhov, Eliot Krzystof Jones, Irène Girard, David Mach, Anastasia Stasenko, and Ivan P Yamshchikov. 2025. Common Corpus: The Largest Col- lection of Ethical Data for LLM Pre-Training. arXiv preprint arXiv:2506.01732 (2025)

  76. [76]

    Clément Le Ludec, Maxime Cornet, and Antonio A Casilli. 2023. The problem with annotation. Human labour and outsourcing between France and Madagas- car. Big Data & Society 10, 2 (2023), 20539517231188723

  77. [77]

    Christina Lee. 2025. Beyond Algorithmic Disgorgement: Remedying Algorith- mic Harms. (2025)

  78. [78]

    Vladimir I Levenshtein et al. 1966. Binary codes capable of correcting deletions, insertions, and reversals. In Soviet physics doklady, Vol. 10. Soviet Union, 707– 710

  79. [79]

    Dongfang Li, Zetian Sun, Xinshuo Hu, Zhenyu Liu, Ziyang Chen, Baotian Hu, Aiguo Wu, and Min Zhang. 2023. A survey of large language models attribution. arXiv preprint arXiv:2311.03731 (2023)

  80. [80]

    Minghao Li, Tengchao Lv, Jingye Chen, Lei Cui, Yijuan Lu, Dinei Florencio, Cha Zhang, Zhoujun Li, and Furu Wei. 2023. Trocr: Transformer-based optical character recognition with pre-trained models. In Proceedings of the AAAI conference on artificial intelligence, Vol. 37. 13094–13102

Showing first 80 references.