pith. machine review for the scientific record. sign in

arxiv: 2602.10995 · v2 · submitted 2026-02-11 · 💻 cs.CY

Recognition: 2 theorem links

· Lean Theorem

A Human-Centric Framework for Data Attribution in Large Language Models

Authors on Pith no claims yet

Pith reviewed 2026-05-16 02:30 UTC · model grok-4.3

classification 💻 cs.CY
keywords data attributionlarge language modelsdata economystakeholder negotiationLLM governancecreator incentiveshuman-centric AI
0
0 comments X

The pith

A framework lets creators, users and intermediaries negotiate data attribution parameters for LLMs.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Current LLM systems leave creators without control over their data and expose users to unwitting plagiarism. The paper proposes a human-centric framework that embeds attribution decisions inside the larger data economy. Use cases such as creative writing assistance or fact-checking are defined by adjustable parameters that capture stakeholder objectives and implementation criteria. These parameters are negotiated among creators, LLM users, and intermediaries, after which the chosen criteria are implemented and tested against the original goals. The approach is intended to connect existing NLP attribution methods with policy governance and economic analysis of creator incentives.

Core claim

The proposed human-centric data attribution framework situates the attribution problem within the broader data economy. Specific use cases for attribution, such as creative writing assistance or fact-checking, can be specified via a set of parameters including stakeholder objectives and implementation criteria. These criteria are up for negotiation by the relevant stakeholder groups: creators, LLM users, and their intermediaries. The outcome of domain-specific negotiations can be implemented and tested for whether the stakeholder goals are achieved.

What carries the argument

The negotiable parameter set of stakeholder objectives and implementation criteria that defines and tests domain-specific attribution use cases.

If this is right

  • Attribution rules can be customized for particular applications such as creative writing or fact-checking.
  • Negotiations can align incentives across creators, users, and platforms in the data economy.
  • Methodological NLP techniques can be applied within governance structures defined by the negotiated criteria.
  • Testing outcomes can indicate whether a sustainable equilibrium for data creators is reached.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The framework implies the need for new mechanisms or institutions to host and enforce the negotiations.
  • Pilot implementations could be run on existing open models to measure measurable outcomes like creator compensation and user citation rates.
  • Success would supply a concrete template that regulators could adopt or adapt for broader AI data rules.

Load-bearing premise

That creators, LLM users, and intermediaries can reach agreements on parameters and that the resulting criteria can be implemented and tested to meet their stated objectives.

What would settle it

Consistent failure of stakeholder negotiations to produce criteria that can be implemented in real LLM systems or that testing shows the criteria do not advance the stated goals of any group.

Figures

Figures reproduced from arXiv: 2602.10995 by Amelie W\"uhrl, Anna Rogers, Kyle Lo, Mattes Ruckdeschel.

Figure 1
Figure 1. Figure 1: The major changes in information flow from the creators to readers/users when LLMs started serving as providers of content, [PITH_FULL_IMAGE:figures/full_fig_p006_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: The human-centric attribution framework is grounded in case-specific stakeholder negotiations, which explicate and balance [PITH_FULL_IMAGE:figures/full_fig_p009_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Moonshot: human-centric data attribution for LLM-assisted creative writing. We show how this process could look like in a [PITH_FULL_IMAGE:figures/full_fig_p013_3.png] view at source ↗
read the original abstract

In the current Large Language Model (LLM) ecosystem, creators have little agency over how their data is used, and LLM users may find themselves unknowingly plagiarizing existing sources. Attribution of LLM-generated text to LLM input data could help with these challenges, but so far we have more questions than answers: what elements of LLM outputs require attribution, what goals should it serve, how should it be implemented? We contribute a human-centric data attribution framework, which situates the attribution problem within the broader data economy. Specific use cases for attribution, such as creative writing assistance or fact-checking, can be specified via a set of parameters (including stakeholder objectives and implementation criteria). These criteria are up for negotiation by the relevant stakeholder groups: creators, LLM users, and their intermediaries (publishers, platforms, AI companies). The outcome of domain-specific negotiations can be implemented and tested for whether the stakeholder goals are achieved. The proposed approach provides a bridge between methodological NLP work on data attribution, governance work on policy interventions, and economic analysis of creator incentives for a sustainable equilibrium in the data economy.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper proposes a human-centric data attribution framework for LLMs that situates attribution within the data economy. Use cases (e.g., creative writing or fact-checking) are specified via parameters capturing stakeholder objectives and implementation criteria; these parameters are negotiated among creators, users, and intermediaries (publishers, platforms, AI companies); the negotiated outcome is then implemented and tested against the original goals. The framework is positioned as a bridge linking NLP attribution methods, governance/policy interventions, and economic analysis of creator incentives.

Significance. If the framework can be operationalized with concrete mechanisms, it would offer a structured interdisciplinary lens for addressing attribution, potentially informing policy and technical standards that balance creator rights, user needs, and system performance in the LLM data economy.

major comments (2)
  1. [Framework description] The manuscript describes a sequence of 'specify use case, negotiate parameters, implement, test' but supplies no protocol or example for resolving objective conflicts (e.g., creator demands for verbatim provenance versus user demands for low-latency generation). This gap is load-bearing for the central bridging claim.
  2. [Implementation and testing phase] No explicit mapping is given from negotiated criteria to existing attribution techniques such as influence functions, data provenance tracking, or membership inference. Without this, the claimed integration with methodological NLP work remains an assertion rather than a demonstrated pathway.
minor comments (1)
  1. [Abstract] The abstract and introduction would benefit from a brief statement that the contribution is a conceptual proposal without empirical validation or code artifacts, to align reader expectations.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their thoughtful review and constructive feedback on our manuscript. We address the major comments below and describe the revisions we intend to incorporate.

read point-by-point responses
  1. Referee: [Framework description] The manuscript describes a sequence of 'specify use case, negotiate parameters, implement, test' but supplies no protocol or example for resolving objective conflicts (e.g., creator demands for verbatim provenance versus user demands for low-latency generation). This gap is load-bearing for the central bridging claim.

    Authors: We recognize that the manuscript presents the negotiation process at a conceptual level without providing a detailed protocol or example for resolving conflicts between stakeholder objectives. This is a valid observation. In the revised version, we will add a dedicated subsection with a worked example of a negotiation scenario. For instance, we will illustrate how parameters for provenance requirements and latency constraints can be balanced through a multi-stakeholder negotiation process, including potential trade-offs and resolution mechanisms. This addition will strengthen the bridging claim by demonstrating the framework's applicability. revision: yes

  2. Referee: [Implementation and testing phase] No explicit mapping is given from negotiated criteria to existing attribution techniques such as influence functions, data provenance tracking, or membership inference. Without this, the claimed integration with methodological NLP work remains an assertion rather than a demonstrated pathway.

    Authors: We agree that an explicit mapping would enhance the manuscript's demonstration of integration with NLP methods. We will revise the manuscript to include a new table that maps sample negotiated criteria (such as 'high accuracy in source attribution' or 'minimal computational overhead') to relevant techniques like influence functions, data provenance tracking, and membership inference attacks, supported by citations to existing literature. This will provide a clearer pathway from the negotiated outcomes to implementation and testing. revision: yes

Circularity Check

0 steps flagged

No circularity: conceptual framework proposal without derivations or self-referential reductions

full rationale

The paper presents a high-level human-centric framework for data attribution in LLMs, describing a sequence of specifying use cases via stakeholder-negotiated parameters (objectives and criteria), followed by implementation and testing. No equations, fitted parameters, derivations, or mathematical claims exist in the text. The central assertion—that the framework bridges NLP attribution methods, policy governance, and economic incentives—is a forward-looking proposal rather than a result derived from or equivalent to its own inputs by construction. No self-citations are invoked as load-bearing uniqueness theorems, ansatzes, or prior fitted results. The framework remains self-contained as a descriptive structure without reducing any prediction or claim to a tautological fit or renaming of known patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 1 invented entities

Based solely on the abstract; the framework rests on domain assumptions about stakeholder negotiation feasibility and technical implementability of attribution criteria, with no free parameters or empirical fits identified.

axioms (2)
  • domain assumption Stakeholder groups can negotiate and agree on attribution parameters that achieve their objectives
    Central to the framework's operation as described in the abstract.
  • domain assumption Attribution criteria can be implemented and tested for goal achievement
    Assumed for the outcome of domain-specific negotiations.
invented entities (1)
  • Human-centric data attribution framework no independent evidence
    purpose: To structure the attribution problem via negotiable parameters
    New conceptual structure introduced to bridge NLP, governance, and economics.

pith-pipeline@v0.9.0 · 5497 in / 1347 out tokens · 43824 ms · 2026-05-16T02:30:01.727003+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

  • IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear
    ?
    unclear

    Relation between the paper passage and the cited Recognition theorem.

    We contribute a human-centric data attribution framework, which situates the attribution problem within the broader data economy. Specific use cases for attribution... can be specified via a set of parameters (including stakeholder objectives and implementation criteria). These criteria are up for negotiation by the relevant stakeholder groups...

  • IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear
    ?
    unclear

    Relation between the paper passage and the cited Recognition theorem.

    At present, there are three broad groups of attribution criteria: similarity to existing content, causal influence on the model, and whether the data was used...

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

196 extracted references · 196 canonical work pages · 5 internal anchors

  1. [1]

    [n. d.]. 1000+ Authors for Libraries. https://www.fightforthefuture.org/authors-for-libraries

  2. [2]

    [n. d.]. AI Licensing for Authors: Who Owns the Rights and What’s a Fair Split? https://authorsguild.org/news/ai-licensing-for-authors-who- owns-the-rights-and-whats-a-fair-split/

  3. [3]

    Google Ireland Limited

    Court of Justice of the European Union 2025.Case C-250/25,Like Company v. Google Ireland Limited. Court of Justice of the European Union. https: //curia.europa.eu/juris/showPdf.jsf?text=&docid=300681&pageIndex=0&doclang=EN&mode=req&dir=&occ=first&part=1&cid=5661670 Request lodged 3 April 2025; referring court: Budapest Környéki Törvényszék (Hungary); deci...

  4. [4]

    [n. d.]. Survey Reveals 90 Percent of Writers Believe Authors Should Be Compensated for the Use of Their Books in Training Generative AI. https://authorsguild.org/news/ai-survey-90-percent-of-writers-believe-authors-should-be-compensated-for-ai-training-use/

  5. [5]

    [n. d.]. WGA Agreement Introduces Key Protections for TV and Film Writers Against AI. https://authorsguild.org/news/wga-agreement- introduces-key-protections-for-tv-and-film-writers-against-ai/

  6. [6]

    Berne Convention for the Protection of Literary and Artistic Works

    1979. Berne Convention for the Protection of Literary and Artistic Works. (sep 1979). https://www.wipo.int/wipolex/en/text/283693

  7. [7]

    SPJ’s Code of Ethics

    2014. SPJ’s Code of Ethics. https://www.spj.org/spj-code-of-ethics/

  8. [8]

    What Is Intellectual Property?

    2020. What Is Intellectual Property?

  9. [9]

    Statement From Terrence Hart, General Counsel, Association of American Publishers on Disinformation in The Internet Archive Case - AAP

    2022. Statement From Terrence Hart, General Counsel, Association of American Publishers on Disinformation in The Internet Archive Case - AAP. https://publishers.org/news/statement-from-terrence-hart-general-counsel-association-of-american-publishers-on-the-internet-archive-case/

  10. [10]

    The Authors Guild, John Grisham, Jodi Picoult, David Baldacci, George R.R

    2023. The Authors Guild, John Grisham, Jodi Picoult, David Baldacci, George R.R. Martin, and 13 Other Authors File Class-Action Suit Against OpenAI. https://authorsguild.org/news/ag-and-authors-file-class-action-suit-against-openai/

  11. [11]

    Stack Overflow and OpenAI Partner to Strengthen the World’s Most Popular Large Language Models - Press Release

    2024. Stack Overflow and OpenAI Partner to Strengthen the World’s Most Popular Large Language Models - Press Release. https://stackoverflow. co/company/press/archive/openai-partnership

  12. [12]

    How Your Data Is Used to Improve Model Performance

    2025. How Your Data Is Used to Improve Model Performance. https://help.openai.com/en/articles/5722486-how-your-data-is-used-to-improve- model-performance

  13. [13]

    IETF Working Group Will Further Develop Our Proposal for an Opt-out Vocabulary

    2025. IETF Working Group Will Further Develop Our Proposal for an Opt-out Vocabulary. https://openfuture.eu/blog/ietf-working-group-will- further-develop-our-proposal-for-an-opt-out-vocabulary

  14. [14]

    Meta Wrongfully Disabling Accounts with No Human Customer Support

    2025. Meta Wrongfully Disabling Accounts with No Human Customer Support. https://www.change.org/p/meta-wrongfully-disabling-accounts- with-no-human-customer-support

  15. [15]

    Microsoft Copilot Terms of Use

    2025. Microsoft Copilot Terms of Use. https://www.microsoft.com/en-gb/microsoft-copilot/for-individuals/termsofuse

  16. [16]

    Recommendations for the Conduct, Reporting, Editing, and Publication of Scholarly Work in Medical Journals

    2025. Recommendations for the Conduct, Reporting, Editing, and Publication of Scholarly Work in Medical Journals. https://www.icmje.org/icmje- recommendations.pdf

  17. [17]

    Part 3: Generative AI Training (Pre-Publication Version)

    2025.Report on Copyright and Artificial Intelligence. Part 3: Generative AI Training (Pre-Publication Version). Technical Report. U.S. Copyright Office. https://www.copyright.gov/ai/Copyright-and-Artificial-Intelligence-Part-3-Generative-AI-Training-Report-Pre-Publication-Version.pdf Manuscript submitted to ACM 16 Wührl et al

  18. [18]

    Mohamed Abdalla, Jan Philip Wahle, Terry Lima Ruas, Aur{\’e}lie N{\’e}v{\’e}ol, Fanny Ducel, Saif Mohammad, and Karen Fort. 2023. The Elephant in the Room: Analyzing the Presence of Big Tech in Natural Language Processing Research. InProceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Associatio...

  19. [19]

    Vincent Acovino. 2023. Sci-Fi Magazine Stops Submissions after Flood of AI Generated Stories.NPR(feb 2023). https://www.npr.org/2023/02/23/ 1159118948/sci-fi-magazine-stops-submissions-after-flood-of-ai-generated-stories

  20. [20]

    Mohiuddin Ahmed and Paul Haskell-Dowland. 2021. Is Google Getting Worse? Increased Advertising and Algorithm Changes May Make It Harder to Find What You’re Looking for. doi:10.64628/AA.av5ws3c54

  21. [21]

    AI watchdog. 2025. Content Licensing Deals. https://aiwatch.dog/licensing

  22. [22]

    Ekin Akyurek, Tolga Bolukbasi, Frederick Liu, Binbin Xiong, Ian Tenney, Jacob Andreas, and Kelvin Guu. 2022. Towards Tracing Knowledge in Language Models Back to the Training Data. InFindings of the Association for Computational Linguistics: EMNLP 2022, Yoav Goldberg, Zornitsa Kozareva, and Yue Zhang (Eds.). Association for Computational Linguistics, Abu ...

  23. [23]

    Davey Alba. 2025. Google Can Train Search AI With Web Content Even After Opt-Out.Bloomberg.com(may 2025). https://www.bloomberg.com/ news/articles/2025-05-03/google-can-train-search-ai-with-web-content-even-after-opt-out

  24. [24]

    Dataset Providers Alliance. 2024. Machine Learning AI Data Licensing. https://www.thedpa.ai

  25. [25]

    Smith, and Timothy Williamson

    Denise Anthony, Sean W. Smith, and Timothy Williamson. 2009. Reputation and Reliability in Collective Goods: The Case of the Online Encyclopedia Wikipedia.Rationality and Society21, 3 (aug 2009), 283–306. doi:10.1177/1043463109336804

  26. [26]

    Glen Weyl

    Imanol Arrieta Ibarra, Leonard Goff, Diego Jiménez Hernández, Jaron Lanier, and E. Glen Weyl. 2017. Should We Treat Data as Labor? Moving Beyond ’Free’. social science research network:3093683 https://papers.ssrn.com/abstract=3093683

  27. [27]

    Santiago Andrés Azcoitia, Costas Iordanou, and Nikolaos Laoutaris. 2023. Understanding the Price of Data in Commercial Data Marketplaces. In 2023 IEEE 39th International Conference on Data Engineering (ICDE). 3718–3728. doi:10.1109/ICDE55515.2023.00300

  28. [28]

    Juhan Bae, Nathan Ng, Alston Lo, Marzyeh Ghassemi, and Roger Grosse. 2022. If Influence Functions are the Answer, Then What is the Question? arXiv:2209.05364 [cs.LG] https://arxiv.org/abs/2209.05364

  29. [29]

    Andy Baio. 2022. AI Data Laundering: How Academic and Nonprofit Researchers Shield Tech Companies from Accountability. https://waxy.org/ 2022/09/ai-data-laundering-how-academic-and-nonprofit-researchers-shield-tech-companies-from-accountability/

  30. [30]

    Documentation Debt

    Jack Bandy and Nicholas Vincent. 2021. Addressing "Documentation Debt" in Machine Learning Research: A Retrospective Datasheet for BookCorpus.arXiv:2105.05241 [cs](may 2021). arXiv:2105.05241 [cs] http://arxiv.org/abs/2105.05241

  31. [31]

    Brian Barrett. 2026. The US Invaded Venezuela and Captured Nicolás Maduro. ChatGPT Disagrees.Wired(Jan. 2026). https://www.wired.com/ story/us-invaded-venezuela-and-captured-nicolas-maduro-chatgpt-disagrees/

  32. [32]

    Roland Barthes. 1988. The Death of the Author. InImage, Music, Text, Stephen Heath (Ed.). Noonday Press, 142–148. https://archive.org/details/ imagemusictext0000bart_e3d9/page/n7/mode/2up

  33. [33]

    Parishad BehnamGhader, Vaibhav Adlakha, Marius Mosbach, Dzmitry Bahdanau, Nicolas Chapados, and Siva Reddy. 2024. LLM2Vec: Large Language Models Are Secretly Powerful Text Encoders. InFirst Conference on Language Modeling. https://openreview.net/forum?id=IW1PR7vEBf

  34. [34]

    There Is Nothing Fair about This

    Ashley Belanger. 2023. Grisham, Martin Join Authors Suing OpenAI: “There Is Nothing Fair about This” [Updated]. https://arstechnica.com/tech- policy/2023/09/george-r-r-martin-joins-authors-suing-openai-over-copyright-infringement/

  35. [35]

    Red-Handed

    Ashley Belanger. 2025. Lawsuit: Reddit Caught Perplexity “Red-Handed” Stealing Data from Google Results. https://arstechnica.com/tech- policy/2025/10/reddit-sues-to-block-perplexity-from-scraping-google-search-results/

  36. [36]

    Ashley Belanger. 2025. OpenAI Declares AI Race “over” If Training on Copyrighted Works Isn’t Fair Use. https://arstechnica.com/tech- policy/2025/03/openai-urges-trump-either-settle-ai-copyright-debate-or-lose-ai-race-to-china/

  37. [37]

    Bender and Batya Friedman

    Emily M. Bender and Batya Friedman. 2018. Data Statements for Natural Language Processing: Toward Mitigating System Bias and Enabling Better Science.Transactions of the Association for Computational Linguistics6 (2018), 587–604. doi:10.1162/tacl_a_00041

  38. [38]

    Stella Biderman, Hailey Schoelkopf, Quentin Gregory Anthony, Herbie Bradley, Kyle O’Brien, Eric Hallahan, Mohammad Aflah Khan, Shivanshu Purohit, Usvsn Sai Prashanth, Edward Raff, Aviya Skowron, Lintang Sutawika, and Oskar Van Der Wal. 2023. Pythia: A Suite for Analyzing Large Language Models Across Training and Scaling. InProceedings of the 40th Internat...

  39. [40]

    ACM Publications Board. 2023. ACM Policy on Plagiarism, Misrepresentation, and Falsification. https://www.acm.org/publications/policies/ plagiarism-overview

  40. [41]

    Julie Bort. 2025. Perplexity CEO Says Its Browser Will Track Everything Users Do Online to Sell ’hyper Personalized’ Ads. https://techcrunch. com/2025/04/24/perplexity-ceo-says-its-browser-will-track-everything-users-do-online-to-sell-hyper-personalized-ads/

  41. [42]

    Russell Brandom. 2025. RSS Co-Creator Launches New Protocol for AI Data Licensing. https://techcrunch.com/2025/09/10/rss-co-creator-launches- new-protocol-for-ai-data-licensing/

  42. [43]

    John Brooks. 2020. The Dilemma of ’Free’: Facebook’s Monopsony Power and the Need For an Antitrust Renaissance. social science research network:3531172 doi:10.2139/ssrn.3531172 Manuscript submitted to ACM A Human-Centric Framework for Data Attribution in Large Language Models 17

  43. [44]

    Should I Stay or Should I Leave?

    Allison J. Brown. 2020. “Should I Stay or Should I Leave?”: Exploring (Dis)Continued Facebook Use After the Cambridge Analytica Scandal.Social Media + Society6, 1 (jan 2020), 2056305120913884. doi:10.1177/2056305120913884

  44. [45]

    Amy Bruckman. 2002. Studying the Amateur Artist: A Perspective on Disguising Data Collected in Human Subjects Research on the Internet. Ethics and Information Technology4, 3 (Sept. 2002), 217–231. doi:10.1023/A:1021316409277

  45. [46]

    Ian Carlos Campbell. 2025. Perplexity Has Cooked up a New Way to Pay Publishers for Their Content. https://www.engadget.com/ai/perplexity- has-cooked-up-a-new-way-to-pay-publishers-for-their-content-204255019.html

  46. [47]

    Nicholas Carlini, Daphne Ippolito, Matthew Jagielski, Katherine Lee, Florian Tramer, and Chiyuan Zhang. 2022. Quantifying memorization across neural language models. InThe Eleventh International Conference on Learning Representations

  47. [48]

    Chang, Dheeraj Rajagopal, Tolga Bolukbasi, Lucas Dixon, and Ian Tenney

    Tyler A. Chang, Dheeraj Rajagopal, Tolga Bolukbasi, Lucas Dixon, and Ian Tenney. 2024. Scalable Influence and Fact Tracing for Large Language Model Pretraining. InThe Thirteenth International Conference on Learning Representations

  48. [49]

    Aaron Chatterji, Thomas Cunningham, David J Deming, Zoe Hitzig, Christopher Ong, Carl Yan Shan, and Kevin Wadman. 2025. How People Use ChatGPT.National Bureau of Economic Research34255 (Sept. 2025). http://www.nber.org/papers/w34255

  49. [50]

    Sang Keun Choe, Hwijeen Ahn, Juhan Bae, Kewen Zhao, Minsoo Kang, Youngseog Chung, Adithya Pratapa, Willie Neiswanger, Emma Strubell, Teruko Mitamura, Jeff Schneider, Eduard Hovy, Roger Grosse, and Eric Xing. 2024. What Is Your Data Worth to GPT? LLM-Scale Data Valuation with Influence Functions. arXiv:2405.13954 [cs] doi:10.48550/arXiv.2405.13954

  50. [51]

    Nicholas Clark, Hua Shen, Bill Howe, and Tanushree Mitra. 2025. Epistemic Alignment: A Mediating Framework for User-LLM Knowledge Delivery. arXiv:2504.01205 [cs.HC] https://arxiv.org/abs/2504.01205

  51. [52]

    Giuseppe Colangelo. 2022. Enforcing Copyright through Antitrust? The Strange Case of News Publishers against Digital Platforms.Journal of Antitrust Enforcement10, 1 (mar 2022), 133–161. doi:10.1093/jaenfo/jnab009

  52. [53]

    European Commission. 2025. Explanatory Notice and Template for the Public Summary of Training Content for General-Purpose AI Models | Shaping Europe’s Digital Future. https://digital-strategy.ec.europa.eu/en/library/explanatory-notice-and-template-public-summary-training- content-general-purpose-ai-models

  53. [54]

    Extracting memorized pieces of (copyrighted) books from open-weight language models

    A. Feder Cooper, Aaron Gokaslan, Amy B. Cyphert, Christopher De Sa, Mark A. Lemley, Daniel E. Ho, and Percy Liang. 2025. Extracting Memorized Pieces of (Copyrighted) Books from Open-Weight Language Models. arXiv:2505.12546 [cs] doi:10.48550/arXiv.2505.12546

  54. [55]

    Michael Crider. 2025. Microsoft Follows Google with Price Bump, Forced AI 365 Bundles | PCWorld.PCWorld(jan 2025). https://www.pcworld. com/article/2581179/microsoft-follows-google-with-price-bump-forced-ai-365-bundles.html

  55. [56]

    Emilia David. 2024. OpenAI’s News Publisher Deals Reportedly Top out at $5 Million a Year. https://www.theverge.com/2024/1/4/24025409/openai- training-data-lowball-nyt-ai-copyright

  56. [57]

    de la Merced and Danielle Kaye

    Andrew Ross SorkinBernhard WarnerSarah KesslerMichael J. de la Merced and Danielle Kaye. 2025. Exclusive: OpenAI Secures Another Giant Funding Deal.The New York Times(aug 2025). https://www.nytimes.com/2025/08/01/business/dealbook/openai-ai-mega-funding-deal.html

  57. [58]

    Chunyuan Deng, Yilun Zhao, Yuzhao Heng, Yitong Li, Jiannan Cao, Xiangru Tang, and Arman Cohan. 2024. Unveiling the Spectrum of Data Contamination in Language Model: A Survey from Detection to Remediation. InFindings of the Association for Computational Linguistics: ACL 2024, Lun-Wei Ku, Andre Martins, and Vivek Srikumar (Eds.). Association for Computation...

  58. [59]

    Junwei Deng, Yuzheng Hu, Pingbang Hu, Ting-wei Li, Shixuan Liu, Jiachen T. Wang, Dan Ley, Qirun Dai, Benhao Huang, Jin Huang, Cathy Jiao, Hoang Anh Just, Yijun Pan, Jingyan Shen, Yiwen Tu, Weiyi Wang, Xinhe Wang, Shichang Zhang, Shiyuan Zhang, Ruoxi Jia, Himabindu Lakkaraju, Hao Peng, Weijing Tang, Chenyan Xiong, Jieyu Zhao, Hanghang Tong, Han Zhao, and J...

  59. [60]

    Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. InProceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). 4171–4186. https://aclw...

  60. [61]

    Josh Dickey. 2025. Penske Media Sues Google for AI ’Overview’ News Story Summaries Without Publishers’ Consent. https://www.thewrap.com/ penske-media-sues-google-ai-overview-news-story-summaries/

  61. [62]

    2025.Enshittification

    Cory Doctorow. 2025.Enshittification. Verso Books, London. https://guardianbookshop.com/enshittification-9781836742227/

  62. [63]

    Jesse Dodge, Maarten Sap, Ana Marasović, William Agnew, Gabriel Ilharco, Dirk Groeneveld, Margaret Mitchell, and Matt Gardner. 2021. Documenting Large Webtext Corpora: A Case Study on the Colossal Clean Crawled Corpus. InProceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, Onlin...

  63. [64]

    Editorial. 2024. The Evolution of Labor Law: A Comprehensive Historical Overview. https://lawslearned.com/history-of-labor-law/

  64. [65]

    Benj Edwards. 2024. Stack Overflow Users Sabotage Their Posts after OpenAI Deal. https://arstechnica.com/information-technology/2024/05/stack- overflow-users-sabotage-their-posts-after-openai-deal/

  65. [66]

    Eiko. 2022. Welcome to Hotel Elsevier: You Can Check-out Any Time You like . . . Not » Eiko Fried. https://eiko-fried.com/welcome-to-hotel- elsevier-you-can-check-out-any-time-you-like-not/

  66. [67]

    Smith, and Jesse Dodge

    Yanai Elazar, Akshita Bhagia, Ian Helgi Magnusson, Abhilasha Ravichander, Dustin Schwenk, Alane Suhr, Evan Pete Walsh, Dirk Groeneveld, Luca Soldaini, Sameer Singh, Hannaneh Hajishirzi, Noah A. Smith, and Jesse Dodge. 2023. What’s In My Big Data?. InThe Twelfth International Conference on Learning Representations. https://openreview.net/forum?id=RvfPnOkPV...

  67. [68]

    Jordan, Ali Makhdoumi, and Azarakhsh Malekian

    Alireza Fallah, Michael I. Jordan, Ali Makhdoumi, and Azarakhsh Malekian. 2024. On Three-Layer Data Markets. https://arxiv.org/abs/2402.09697v4

  68. [69]

    Henry Farrell, Alison Gopnik, Cosma Shalizi, and James Evans. 2025. Large AI Models Are Cultural and Social Technologies.Science387, 6739 (mar 2025), 1153–1156. doi:10.1126/science.adt9819

  69. [70]

    Sara Fischer. 2024. AI Startup TollBit Raises $24M Series A. https://www.axios.com/2024/10/22/ai-startup-tollbit-media-publishers

  70. [71]

    Richard Florida. 2022. The Rise of the Creator Economy. https://creativeclass.com/reports/The_Rise_of_the_Creator_Economy.pdf

  71. [72]

    Thibault Formal, Benjamin Piwowarski, and Stéphane Clinchant. 2021. SPLADE: Sparse Lexical and Expansion Model for First Stage Ranking. In Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval(Virtual Event, Canada)(SIGIR ’21). Association for Computing Machinery, New York, NY, USA, 2288–2292. doi:...

  72. [73]

    Andrea Forte and Amy Bruckman. 2005. Why Do People Write for Wikipedia? Incentives to Contribute to Open-Content Publishing. (Nov. 2005)

  73. [74]

    2025.NEW CASE: Foxglove launches international legal challenge to Google’s worldwide theft of news!Foxglove

    Foxglove Legal. 2025.NEW CASE: Foxglove launches international legal challenge to Google’s worldwide theft of news!Foxglove. https://foxglove.org.uk

  74. [75]

    2025.Perplexity accused of scraping websites that explicitly blocked AI scraping

    Lorenzo Franceschi-Bicchierai. 2025.Perplexity accused of scraping websites that explicitly blocked AI scraping. TechCrunch. https://techcrunch. com/2025/08/04/perplexity-accused-of-scraping-websites-that-explicitly-blocked-ai-scraping/

  75. [76]

    Timnit Gebru, Jamie Morgenstern, Briana Vecchione, Jennifer Wortman Vaughan, Hanna Wallach, Hal Daumé III, and Kate Crawford. 2020. Datasheets for Datasets.arXiv:1803.09010 [cs](mar 2020). arXiv:1803.09010 [cs] http://arxiv.org/abs/1803.09010

  76. [77]

    Thomas Germain. 2025. Is Google about to Destroy the Web?BBC(jun 2025). https://www.bbc.com/future/article/20250611-ai-mode-is-google- about-to-change-the-internet-forever

  77. [78]

    Carlos Gil. 2024. Stop Chasing Algorithms — Here’s How Creators Can Take Control of Their Content and Monetize on Their Own Terms. https://www.entrepreneur.com/science-technology/why-relying-on-social-media-for-income-is-a-losing-game-for/481348

  78. [79]

    Dirk Groeneveld, Iz Beltagy, Pete Walsh, Akshita Bhagia, Rodney Kinney, Oyvind Tafjord, Ananya Harsh Jha, Hamish Ivison, Ian Magnusson, Yizhong Wang, Shane Arora, David Atkinson, Russell Authur, Khyathi Raghavi Chandu, Arman Cohan, Jennifer Dumas, Yanai Elazar, Yuling Gu, Jack Hessel, Tushar Khot, William Merrill, Jacob Morrison, Niklas Muennighoff, Aakan...

  79. [80]

    Roger Grosse, Juhan Bae, Cem Anil, Nelson Elhage, Alex Tamkin, Amirhossein Tajdini, Benoit Steiner, Dustin Li, Esin Durmus, Ethan Perez, Evan Hubinger, Kamil˙e Lukoši¯ut˙e, Karina Nguyen, Nicholas Joseph, Sam McCandlish, Jared Kaplan, and Samuel R. Bowman. 2023. Studying Large Language Model Generalization with Influence Functions. arXiv:2308.03296 [cs.LG...

  80. [81]

    Tarun Gupta and Danish Pruthi. 2025. All That Glitters is Not Novel: Plagiarism in AI Generated Research. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Wanxiang Che, Joyce Nabende, Ekaterina Shutova, and Mohammad Taher Pilehvar (Eds.). Association for Computational Linguistics, Vienna, A...

Showing first 80 references.