arxiv: 2602.10995 · v2 · submitted 2026-02-11 · 💻 cs.CY

Recognition: 2 theorem links

· Lean Theorem

A Human-Centric Framework for Data Attribution in Large Language Models

Amelie W\"uhrl , Mattes Ruckdeschel , Kyle Lo , Anna Rogers

Authors on Pith no claims yet

Pith reviewed 2026-05-16 02:30 UTC · model grok-4.3

classification 💻 cs.CY

keywords data attributionlarge language modelsdata economystakeholder negotiationLLM governancecreator incentiveshuman-centric AI

0 comments

The pith

A framework lets creators, users and intermediaries negotiate data attribution parameters for LLMs.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Current LLM systems leave creators without control over their data and expose users to unwitting plagiarism. The paper proposes a human-centric framework that embeds attribution decisions inside the larger data economy. Use cases such as creative writing assistance or fact-checking are defined by adjustable parameters that capture stakeholder objectives and implementation criteria. These parameters are negotiated among creators, LLM users, and intermediaries, after which the chosen criteria are implemented and tested against the original goals. The approach is intended to connect existing NLP attribution methods with policy governance and economic analysis of creator incentives.

Core claim

The proposed human-centric data attribution framework situates the attribution problem within the broader data economy. Specific use cases for attribution, such as creative writing assistance or fact-checking, can be specified via a set of parameters including stakeholder objectives and implementation criteria. These criteria are up for negotiation by the relevant stakeholder groups: creators, LLM users, and their intermediaries. The outcome of domain-specific negotiations can be implemented and tested for whether the stakeholder goals are achieved.

What carries the argument

The negotiable parameter set of stakeholder objectives and implementation criteria that defines and tests domain-specific attribution use cases.

If this is right

Attribution rules can be customized for particular applications such as creative writing or fact-checking.
Negotiations can align incentives across creators, users, and platforms in the data economy.
Methodological NLP techniques can be applied within governance structures defined by the negotiated criteria.
Testing outcomes can indicate whether a sustainable equilibrium for data creators is reached.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The framework implies the need for new mechanisms or institutions to host and enforce the negotiations.
Pilot implementations could be run on existing open models to measure measurable outcomes like creator compensation and user citation rates.
Success would supply a concrete template that regulators could adopt or adapt for broader AI data rules.

Load-bearing premise

That creators, LLM users, and intermediaries can reach agreements on parameters and that the resulting criteria can be implemented and tested to meet their stated objectives.

What would settle it

Consistent failure of stakeholder negotiations to produce criteria that can be implemented in real LLM systems or that testing shows the criteria do not advance the stated goals of any group.

Figures

Figures reproduced from arXiv: 2602.10995 by Amelie W\"uhrl, Anna Rogers, Kyle Lo, Mattes Ruckdeschel.

**Figure 1.** Figure 1: The major changes in information flow from the creators to readers/users when LLMs started serving as providers of content, [PITH_FULL_IMAGE:figures/full_fig_p006_1.png] view at source ↗

**Figure 2.** Figure 2: The human-centric attribution framework is grounded in case-specific stakeholder negotiations, which explicate and balance [PITH_FULL_IMAGE:figures/full_fig_p009_2.png] view at source ↗

**Figure 3.** Figure 3: Moonshot: human-centric data attribution for LLM-assisted creative writing. We show how this process could look like in a [PITH_FULL_IMAGE:figures/full_fig_p013_3.png] view at source ↗

read the original abstract

In the current Large Language Model (LLM) ecosystem, creators have little agency over how their data is used, and LLM users may find themselves unknowingly plagiarizing existing sources. Attribution of LLM-generated text to LLM input data could help with these challenges, but so far we have more questions than answers: what elements of LLM outputs require attribution, what goals should it serve, how should it be implemented? We contribute a human-centric data attribution framework, which situates the attribution problem within the broader data economy. Specific use cases for attribution, such as creative writing assistance or fact-checking, can be specified via a set of parameters (including stakeholder objectives and implementation criteria). These criteria are up for negotiation by the relevant stakeholder groups: creators, LLM users, and their intermediaries (publishers, platforms, AI companies). The outcome of domain-specific negotiations can be implemented and tested for whether the stakeholder goals are achieved. The proposed approach provides a bridge between methodological NLP work on data attribution, governance work on policy interventions, and economic analysis of creator incentives for a sustainable equilibrium in the data economy.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This is a high-level conceptual proposal for stakeholder-negotiated data attribution in LLMs that correctly flags the governance gaps but supplies no mechanisms or examples for turning negotiations into working systems.

read the letter

The main takeaway is that the paper puts forward a framework where creators, users, and intermediaries negotiate parameters for attribution in specific LLM use cases, then implement and test them. This is positioned as a way to connect NLP attribution techniques with policy and economic incentives. It is new in making the negotiation step explicit and central rather than treating attribution as a purely technical problem. The paper does a clear job of stating the open questions—what counts for attribution, what goals it should serve—and situating them in the data economy without overclaiming results.

Referee Report

2 major / 1 minor

Summary. The paper proposes a human-centric data attribution framework for LLMs that situates attribution within the data economy. Use cases (e.g., creative writing or fact-checking) are specified via parameters capturing stakeholder objectives and implementation criteria; these parameters are negotiated among creators, users, and intermediaries (publishers, platforms, AI companies); the negotiated outcome is then implemented and tested against the original goals. The framework is positioned as a bridge linking NLP attribution methods, governance/policy interventions, and economic analysis of creator incentives.

Significance. If the framework can be operationalized with concrete mechanisms, it would offer a structured interdisciplinary lens for addressing attribution, potentially informing policy and technical standards that balance creator rights, user needs, and system performance in the LLM data economy.

major comments (2)

[Framework description] The manuscript describes a sequence of 'specify use case, negotiate parameters, implement, test' but supplies no protocol or example for resolving objective conflicts (e.g., creator demands for verbatim provenance versus user demands for low-latency generation). This gap is load-bearing for the central bridging claim.
[Implementation and testing phase] No explicit mapping is given from negotiated criteria to existing attribution techniques such as influence functions, data provenance tracking, or membership inference. Without this, the claimed integration with methodological NLP work remains an assertion rather than a demonstrated pathway.

minor comments (1)

[Abstract] The abstract and introduction would benefit from a brief statement that the contribution is a conceptual proposal without empirical validation or code artifacts, to align reader expectations.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their thoughtful review and constructive feedback on our manuscript. We address the major comments below and describe the revisions we intend to incorporate.

read point-by-point responses

Referee: [Framework description] The manuscript describes a sequence of 'specify use case, negotiate parameters, implement, test' but supplies no protocol or example for resolving objective conflicts (e.g., creator demands for verbatim provenance versus user demands for low-latency generation). This gap is load-bearing for the central bridging claim.

Authors: We recognize that the manuscript presents the negotiation process at a conceptual level without providing a detailed protocol or example for resolving conflicts between stakeholder objectives. This is a valid observation. In the revised version, we will add a dedicated subsection with a worked example of a negotiation scenario. For instance, we will illustrate how parameters for provenance requirements and latency constraints can be balanced through a multi-stakeholder negotiation process, including potential trade-offs and resolution mechanisms. This addition will strengthen the bridging claim by demonstrating the framework's applicability. revision: yes
Referee: [Implementation and testing phase] No explicit mapping is given from negotiated criteria to existing attribution techniques such as influence functions, data provenance tracking, or membership inference. Without this, the claimed integration with methodological NLP work remains an assertion rather than a demonstrated pathway.

Authors: We agree that an explicit mapping would enhance the manuscript's demonstration of integration with NLP methods. We will revise the manuscript to include a new table that maps sample negotiated criteria (such as 'high accuracy in source attribution' or 'minimal computational overhead') to relevant techniques like influence functions, data provenance tracking, and membership inference attacks, supported by citations to existing literature. This will provide a clearer pathway from the negotiated outcomes to implementation and testing. revision: yes

Circularity Check

0 steps flagged

No circularity: conceptual framework proposal without derivations or self-referential reductions

full rationale

The paper presents a high-level human-centric framework for data attribution in LLMs, describing a sequence of specifying use cases via stakeholder-negotiated parameters (objectives and criteria), followed by implementation and testing. No equations, fitted parameters, derivations, or mathematical claims exist in the text. The central assertion—that the framework bridges NLP attribution methods, policy governance, and economic incentives—is a forward-looking proposal rather than a result derived from or equivalent to its own inputs by construction. No self-citations are invoked as load-bearing uniqueness theorems, ansatzes, or prior fitted results. The framework remains self-contained as a descriptive structure without reducing any prediction or claim to a tautological fit or renaming of known patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 1 invented entities

Based solely on the abstract; the framework rests on domain assumptions about stakeholder negotiation feasibility and technical implementability of attribution criteria, with no free parameters or empirical fits identified.

axioms (2)

domain assumption Stakeholder groups can negotiate and agree on attribution parameters that achieve their objectives
Central to the framework's operation as described in the abstract.
domain assumption Attribution criteria can be implemented and tested for goal achievement
Assumed for the outcome of domain-specific negotiations.

invented entities (1)

Human-centric data attribution framework no independent evidence
purpose: To structure the attribution problem via negotiable parameters
New conceptual structure introduced to bridge NLP, governance, and economics.

pith-pipeline@v0.9.0 · 5497 in / 1347 out tokens · 43824 ms · 2026-05-16T02:30:01.727003+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We contribute a human-centric data attribution framework, which situates the attribution problem within the broader data economy. Specific use cases for attribution... can be specified via a set of parameters (including stakeholder objectives and implementation criteria). These criteria are up for negotiation by the relevant stakeholder groups...
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

At present, there are three broad groups of attribution criteria: similarity to existing content, causal influence on the model, and whether the data was used...

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

196 extracted references · 196 canonical work pages · 5 internal anchors

[1]

[n. d.]. 1000+ Authors for Libraries. https://www.fightforthefuture.org/authors-for-libraries

work page
[2]

[n. d.]. AI Licensing for Authors: Who Owns the Rights and What’s a Fair Split? https://authorsguild.org/news/ai-licensing-for-authors-who- owns-the-rights-and-whats-a-fair-split/

work page
[3]

Google Ireland Limited

Court of Justice of the European Union 2025.Case C-250/25,Like Company v. Google Ireland Limited. Court of Justice of the European Union. https: //curia.europa.eu/juris/showPdf.jsf?text=&docid=300681&pageIndex=0&doclang=EN&mode=req&dir=&occ=first&part=1&cid=5661670 Request lodged 3 April 2025; referring court: Budapest Környéki Törvényszék (Hungary); deci...

work page 2025
[4]

[n. d.]. Survey Reveals 90 Percent of Writers Believe Authors Should Be Compensated for the Use of Their Books in Training Generative AI. https://authorsguild.org/news/ai-survey-90-percent-of-writers-believe-authors-should-be-compensated-for-ai-training-use/

work page
[5]

[n. d.]. WGA Agreement Introduces Key Protections for TV and Film Writers Against AI. https://authorsguild.org/news/wga-agreement- introduces-key-protections-for-tv-and-film-writers-against-ai/

work page
[6]

Berne Convention for the Protection of Literary and Artistic Works

1979. Berne Convention for the Protection of Literary and Artistic Works. (sep 1979). https://www.wipo.int/wipolex/en/text/283693

work page 1979
[7]

SPJ’s Code of Ethics

2014. SPJ’s Code of Ethics. https://www.spj.org/spj-code-of-ethics/

work page 2014
[8]

What Is Intellectual Property?

2020. What Is Intellectual Property?

work page 2020
[9]

Statement From Terrence Hart, General Counsel, Association of American Publishers on Disinformation in The Internet Archive Case - AAP

2022. Statement From Terrence Hart, General Counsel, Association of American Publishers on Disinformation in The Internet Archive Case - AAP. https://publishers.org/news/statement-from-terrence-hart-general-counsel-association-of-american-publishers-on-the-internet-archive-case/

work page 2022
[10]

The Authors Guild, John Grisham, Jodi Picoult, David Baldacci, George R.R

2023. The Authors Guild, John Grisham, Jodi Picoult, David Baldacci, George R.R. Martin, and 13 Other Authors File Class-Action Suit Against OpenAI. https://authorsguild.org/news/ag-and-authors-file-class-action-suit-against-openai/

work page 2023
[11]

Stack Overflow and OpenAI Partner to Strengthen the World’s Most Popular Large Language Models - Press Release

2024. Stack Overflow and OpenAI Partner to Strengthen the World’s Most Popular Large Language Models - Press Release. https://stackoverflow. co/company/press/archive/openai-partnership

work page 2024
[12]

How Your Data Is Used to Improve Model Performance

2025. How Your Data Is Used to Improve Model Performance. https://help.openai.com/en/articles/5722486-how-your-data-is-used-to-improve- model-performance

work page arXiv 2025
[13]

IETF Working Group Will Further Develop Our Proposal for an Opt-out Vocabulary

2025. IETF Working Group Will Further Develop Our Proposal for an Opt-out Vocabulary. https://openfuture.eu/blog/ietf-working-group-will- further-develop-our-proposal-for-an-opt-out-vocabulary

work page 2025
[14]

Meta Wrongfully Disabling Accounts with No Human Customer Support

2025. Meta Wrongfully Disabling Accounts with No Human Customer Support. https://www.change.org/p/meta-wrongfully-disabling-accounts- with-no-human-customer-support

work page 2025
[15]

Microsoft Copilot Terms of Use

2025. Microsoft Copilot Terms of Use. https://www.microsoft.com/en-gb/microsoft-copilot/for-individuals/termsofuse

work page 2025
[16]

Recommendations for the Conduct, Reporting, Editing, and Publication of Scholarly Work in Medical Journals

2025. Recommendations for the Conduct, Reporting, Editing, and Publication of Scholarly Work in Medical Journals. https://www.icmje.org/icmje- recommendations.pdf

work page 2025
[17]

Part 3: Generative AI Training (Pre-Publication Version)

2025.Report on Copyright and Artificial Intelligence. Part 3: Generative AI Training (Pre-Publication Version). Technical Report. U.S. Copyright Office. https://www.copyright.gov/ai/Copyright-and-Artificial-Intelligence-Part-3-Generative-AI-Training-Report-Pre-Publication-Version.pdf Manuscript submitted to ACM 16 Wührl et al

work page 2025
[18]

Mohamed Abdalla, Jan Philip Wahle, Terry Lima Ruas, Aur{\’e}lie N{\’e}v{\’e}ol, Fanny Ducel, Saif Mohammad, and Karen Fort. 2023. The Elephant in the Room: Analyzing the Presence of Big Tech in Natural Language Processing Research. InProceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Associatio...

work page 2023
[19]

Vincent Acovino. 2023. Sci-Fi Magazine Stops Submissions after Flood of AI Generated Stories.NPR(feb 2023). https://www.npr.org/2023/02/23/ 1159118948/sci-fi-magazine-stops-submissions-after-flood-of-ai-generated-stories

work page 2023
[20]

Mohiuddin Ahmed and Paul Haskell-Dowland. 2021. Is Google Getting Worse? Increased Advertising and Algorithm Changes May Make It Harder to Find What You’re Looking for. doi:10.64628/AA.av5ws3c54

work page doi:10.64628/aa.av5ws3c54 2021
[21]

AI watchdog. 2025. Content Licensing Deals. https://aiwatch.dog/licensing

work page 2025
[22]

Ekin Akyurek, Tolga Bolukbasi, Frederick Liu, Binbin Xiong, Ian Tenney, Jacob Andreas, and Kelvin Guu. 2022. Towards Tracing Knowledge in Language Models Back to the Training Data. InFindings of the Association for Computational Linguistics: EMNLP 2022, Yoav Goldberg, Zornitsa Kozareva, and Yue Zhang (Eds.). Association for Computational Linguistics, Abu ...

work page doi:10.18653/v1/2022 2022
[23]

Davey Alba. 2025. Google Can Train Search AI With Web Content Even After Opt-Out.Bloomberg.com(may 2025). https://www.bloomberg.com/ news/articles/2025-05-03/google-can-train-search-ai-with-web-content-even-after-opt-out

work page 2025
[24]

Dataset Providers Alliance. 2024. Machine Learning AI Data Licensing. https://www.thedpa.ai

work page 2024
[25]

Smith, and Timothy Williamson

Denise Anthony, Sean W. Smith, and Timothy Williamson. 2009. Reputation and Reliability in Collective Goods: The Case of the Online Encyclopedia Wikipedia.Rationality and Society21, 3 (aug 2009), 283–306. doi:10.1177/1043463109336804

work page doi:10.1177/1043463109336804 2009
[26]

Glen Weyl

Imanol Arrieta Ibarra, Leonard Goff, Diego Jiménez Hernández, Jaron Lanier, and E. Glen Weyl. 2017. Should We Treat Data as Labor? Moving Beyond ’Free’. social science research network:3093683 https://papers.ssrn.com/abstract=3093683

work page 2017
[27]

Santiago Andrés Azcoitia, Costas Iordanou, and Nikolaos Laoutaris. 2023. Understanding the Price of Data in Commercial Data Marketplaces. In 2023 IEEE 39th International Conference on Data Engineering (ICDE). 3718–3728. doi:10.1109/ICDE55515.2023.00300

work page doi:10.1109/icde55515.2023.00300 2023
[28]

Juhan Bae, Nathan Ng, Alston Lo, Marzyeh Ghassemi, and Roger Grosse. 2022. If Influence Functions are the Answer, Then What is the Question? arXiv:2209.05364 [cs.LG] https://arxiv.org/abs/2209.05364

work page arXiv 2022
[29]

Andy Baio. 2022. AI Data Laundering: How Academic and Nonprofit Researchers Shield Tech Companies from Accountability. https://waxy.org/ 2022/09/ai-data-laundering-how-academic-and-nonprofit-researchers-shield-tech-companies-from-accountability/

work page 2022
[30]

Documentation Debt

Jack Bandy and Nicholas Vincent. 2021. Addressing "Documentation Debt" in Machine Learning Research: A Retrospective Datasheet for BookCorpus.arXiv:2105.05241 [cs](may 2021). arXiv:2105.05241 [cs] http://arxiv.org/abs/2105.05241

work page arXiv 2021
[31]

Brian Barrett. 2026. The US Invaded Venezuela and Captured Nicolás Maduro. ChatGPT Disagrees.Wired(Jan. 2026). https://www.wired.com/ story/us-invaded-venezuela-and-captured-nicolas-maduro-chatgpt-disagrees/

work page 2026
[32]

Roland Barthes. 1988. The Death of the Author. InImage, Music, Text, Stephen Heath (Ed.). Noonday Press, 142–148. https://archive.org/details/ imagemusictext0000bart_e3d9/page/n7/mode/2up

work page 1988
[33]

Parishad BehnamGhader, Vaibhav Adlakha, Marius Mosbach, Dzmitry Bahdanau, Nicolas Chapados, and Siva Reddy. 2024. LLM2Vec: Large Language Models Are Secretly Powerful Text Encoders. InFirst Conference on Language Modeling. https://openreview.net/forum?id=IW1PR7vEBf

work page 2024
[34]

There Is Nothing Fair about This

Ashley Belanger. 2023. Grisham, Martin Join Authors Suing OpenAI: “There Is Nothing Fair about This” [Updated]. https://arstechnica.com/tech- policy/2023/09/george-r-r-martin-joins-authors-suing-openai-over-copyright-infringement/

work page 2023
[35]

Red-Handed

Ashley Belanger. 2025. Lawsuit: Reddit Caught Perplexity “Red-Handed” Stealing Data from Google Results. https://arstechnica.com/tech- policy/2025/10/reddit-sues-to-block-perplexity-from-scraping-google-search-results/

work page 2025
[36]

Ashley Belanger. 2025. OpenAI Declares AI Race “over” If Training on Copyrighted Works Isn’t Fair Use. https://arstechnica.com/tech- policy/2025/03/openai-urges-trump-either-settle-ai-copyright-debate-or-lose-ai-race-to-china/

work page 2025
[37]

Bender and Batya Friedman

Emily M. Bender and Batya Friedman. 2018. Data Statements for Natural Language Processing: Toward Mitigating System Bias and Enabling Better Science.Transactions of the Association for Computational Linguistics6 (2018), 587–604. doi:10.1162/tacl_a_00041

work page doi:10.1162/tacl_a_00041 2018
[38]

Stella Biderman, Hailey Schoelkopf, Quentin Gregory Anthony, Herbie Bradley, Kyle O’Brien, Eric Hallahan, Mohammad Aflah Khan, Shivanshu Purohit, Usvsn Sai Prashanth, Edward Raff, Aviya Skowron, Lintang Sutawika, and Oskar Van Der Wal. 2023. Pythia: A Suite for Analyzing Large Language Models Across Training and Scaling. InProceedings of the 40th Internat...

work page 2023
[40]

ACM Publications Board. 2023. ACM Policy on Plagiarism, Misrepresentation, and Falsification. https://www.acm.org/publications/policies/ plagiarism-overview

work page 2023
[41]

Julie Bort. 2025. Perplexity CEO Says Its Browser Will Track Everything Users Do Online to Sell ’hyper Personalized’ Ads. https://techcrunch. com/2025/04/24/perplexity-ceo-says-its-browser-will-track-everything-users-do-online-to-sell-hyper-personalized-ads/

work page 2025
[42]

Russell Brandom. 2025. RSS Co-Creator Launches New Protocol for AI Data Licensing. https://techcrunch.com/2025/09/10/rss-co-creator-launches- new-protocol-for-ai-data-licensing/

work page 2025
[43]

John Brooks. 2020. The Dilemma of ’Free’: Facebook’s Monopsony Power and the Need For an Antitrust Renaissance. social science research network:3531172 doi:10.2139/ssrn.3531172 Manuscript submitted to ACM A Human-Centric Framework for Data Attribution in Large Language Models 17

work page doi:10.2139/ssrn.3531172 2020
[44]

Should I Stay or Should I Leave?

Allison J. Brown. 2020. “Should I Stay or Should I Leave?”: Exploring (Dis)Continued Facebook Use After the Cambridge Analytica Scandal.Social Media + Society6, 1 (jan 2020), 2056305120913884. doi:10.1177/2056305120913884

work page doi:10.1177/2056305120913884 2020
[45]

Amy Bruckman. 2002. Studying the Amateur Artist: A Perspective on Disguising Data Collected in Human Subjects Research on the Internet. Ethics and Information Technology4, 3 (Sept. 2002), 217–231. doi:10.1023/A:1021316409277

work page doi:10.1023/a:1021316409277 2002
[46]

Ian Carlos Campbell. 2025. Perplexity Has Cooked up a New Way to Pay Publishers for Their Content. https://www.engadget.com/ai/perplexity- has-cooked-up-a-new-way-to-pay-publishers-for-their-content-204255019.html

work page 2025
[47]

Nicholas Carlini, Daphne Ippolito, Matthew Jagielski, Katherine Lee, Florian Tramer, and Chiyuan Zhang. 2022. Quantifying memorization across neural language models. InThe Eleventh International Conference on Learning Representations

work page 2022
[48]

Chang, Dheeraj Rajagopal, Tolga Bolukbasi, Lucas Dixon, and Ian Tenney

Tyler A. Chang, Dheeraj Rajagopal, Tolga Bolukbasi, Lucas Dixon, and Ian Tenney. 2024. Scalable Influence and Fact Tracing for Large Language Model Pretraining. InThe Thirteenth International Conference on Learning Representations

work page 2024
[49]

Aaron Chatterji, Thomas Cunningham, David J Deming, Zoe Hitzig, Christopher Ong, Carl Yan Shan, and Kevin Wadman. 2025. How People Use ChatGPT.National Bureau of Economic Research34255 (Sept. 2025). http://www.nber.org/papers/w34255

work page 2025
[50]

Sang Keun Choe, Hwijeen Ahn, Juhan Bae, Kewen Zhao, Minsoo Kang, Youngseog Chung, Adithya Pratapa, Willie Neiswanger, Emma Strubell, Teruko Mitamura, Jeff Schneider, Eduard Hovy, Roger Grosse, and Eric Xing. 2024. What Is Your Data Worth to GPT? LLM-Scale Data Valuation with Influence Functions. arXiv:2405.13954 [cs] doi:10.48550/arXiv.2405.13954

work page doi:10.48550/arxiv.2405.13954 2024
[51]

Nicholas Clark, Hua Shen, Bill Howe, and Tanushree Mitra. 2025. Epistemic Alignment: A Mediating Framework for User-LLM Knowledge Delivery. arXiv:2504.01205 [cs.HC] https://arxiv.org/abs/2504.01205

work page arXiv 2025
[52]

Giuseppe Colangelo. 2022. Enforcing Copyright through Antitrust? The Strange Case of News Publishers against Digital Platforms.Journal of Antitrust Enforcement10, 1 (mar 2022), 133–161. doi:10.1093/jaenfo/jnab009

work page doi:10.1093/jaenfo/jnab009 2022
[53]

European Commission. 2025. Explanatory Notice and Template for the Public Summary of Training Content for General-Purpose AI Models | Shaping Europe’s Digital Future. https://digital-strategy.ec.europa.eu/en/library/explanatory-notice-and-template-public-summary-training- content-general-purpose-ai-models

work page 2025
[54]

Extracting memorized pieces of (copyrighted) books from open-weight language models

A. Feder Cooper, Aaron Gokaslan, Amy B. Cyphert, Christopher De Sa, Mark A. Lemley, Daniel E. Ho, and Percy Liang. 2025. Extracting Memorized Pieces of (Copyrighted) Books from Open-Weight Language Models. arXiv:2505.12546 [cs] doi:10.48550/arXiv.2505.12546

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2505.12546 2025
[55]

Michael Crider. 2025. Microsoft Follows Google with Price Bump, Forced AI 365 Bundles | PCWorld.PCWorld(jan 2025). https://www.pcworld. com/article/2581179/microsoft-follows-google-with-price-bump-forced-ai-365-bundles.html

work page arXiv 2025
[56]

Emilia David. 2024. OpenAI’s News Publisher Deals Reportedly Top out at $5 Million a Year. https://www.theverge.com/2024/1/4/24025409/openai- training-data-lowball-nyt-ai-copyright

work page 2024
[57]

de la Merced and Danielle Kaye

Andrew Ross SorkinBernhard WarnerSarah KesslerMichael J. de la Merced and Danielle Kaye. 2025. Exclusive: OpenAI Secures Another Giant Funding Deal.The New York Times(aug 2025). https://www.nytimes.com/2025/08/01/business/dealbook/openai-ai-mega-funding-deal.html

work page 2025
[58]

Chunyuan Deng, Yilun Zhao, Yuzhao Heng, Yitong Li, Jiannan Cao, Xiangru Tang, and Arman Cohan. 2024. Unveiling the Spectrum of Data Contamination in Language Model: A Survey from Detection to Remediation. InFindings of the Association for Computational Linguistics: ACL 2024, Lun-Wei Ku, Andre Martins, and Vivek Srikumar (Eds.). Association for Computation...

work page doi:10.18653/v1/2024.findings-acl.951 2024
[59]

Junwei Deng, Yuzheng Hu, Pingbang Hu, Ting-wei Li, Shixuan Liu, Jiachen T. Wang, Dan Ley, Qirun Dai, Benhao Huang, Jin Huang, Cathy Jiao, Hoang Anh Just, Yijun Pan, Jingyan Shen, Yiwen Tu, Weiyi Wang, Xinhe Wang, Shichang Zhang, Shiyuan Zhang, Ruoxi Jia, Himabindu Lakkaraju, Hao Peng, Weijing Tang, Chenyan Xiong, Jieyu Zhao, Hanghang Tong, Han Zhao, and J...

work page doi:10.2139/ssrn.5451054 2025
[60]

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. InProceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). 4171–4186. https://aclw...

work page 2019
[61]

Josh Dickey. 2025. Penske Media Sues Google for AI ’Overview’ News Story Summaries Without Publishers’ Consent. https://www.thewrap.com/ penske-media-sues-google-ai-overview-news-story-summaries/

work page 2025
[62]

2025.Enshittification

Cory Doctorow. 2025.Enshittification. Verso Books, London. https://guardianbookshop.com/enshittification-9781836742227/

work page 2025
[63]

Jesse Dodge, Maarten Sap, Ana Marasović, William Agnew, Gabriel Ilharco, Dirk Groeneveld, Margaret Mitchell, and Matt Gardner. 2021. Documenting Large Webtext Corpora: A Case Study on the Colossal Clean Crawled Corpus. InProceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, Onlin...

work page doi:10.18653/v1/2021.emnlp-main.98 2021
[64]

Editorial. 2024. The Evolution of Labor Law: A Comprehensive Historical Overview. https://lawslearned.com/history-of-labor-law/

work page 2024
[65]

Benj Edwards. 2024. Stack Overflow Users Sabotage Their Posts after OpenAI Deal. https://arstechnica.com/information-technology/2024/05/stack- overflow-users-sabotage-their-posts-after-openai-deal/

work page 2024
[66]

Eiko. 2022. Welcome to Hotel Elsevier: You Can Check-out Any Time You like . . . Not » Eiko Fried. https://eiko-fried.com/welcome-to-hotel- elsevier-you-can-check-out-any-time-you-like-not/

work page 2022
[67]

Smith, and Jesse Dodge

Yanai Elazar, Akshita Bhagia, Ian Helgi Magnusson, Abhilasha Ravichander, Dustin Schwenk, Alane Suhr, Evan Pete Walsh, Dirk Groeneveld, Luca Soldaini, Sameer Singh, Hannaneh Hajishirzi, Noah A. Smith, and Jesse Dodge. 2023. What’s In My Big Data?. InThe Twelfth International Conference on Learning Representations. https://openreview.net/forum?id=RvfPnOkPV...

work page 2023
[68]

Jordan, Ali Makhdoumi, and Azarakhsh Malekian

Alireza Fallah, Michael I. Jordan, Ali Makhdoumi, and Azarakhsh Malekian. 2024. On Three-Layer Data Markets. https://arxiv.org/abs/2402.09697v4

work page arXiv 2024
[69]

Henry Farrell, Alison Gopnik, Cosma Shalizi, and James Evans. 2025. Large AI Models Are Cultural and Social Technologies.Science387, 6739 (mar 2025), 1153–1156. doi:10.1126/science.adt9819

work page doi:10.1126/science.adt9819 2025
[70]

Sara Fischer. 2024. AI Startup TollBit Raises $24M Series A. https://www.axios.com/2024/10/22/ai-startup-tollbit-media-publishers

work page 2024
[71]

Richard Florida. 2022. The Rise of the Creator Economy. https://creativeclass.com/reports/The_Rise_of_the_Creator_Economy.pdf

work page 2022
[72]

Thibault Formal, Benjamin Piwowarski, and Stéphane Clinchant. 2021. SPLADE: Sparse Lexical and Expansion Model for First Stage Ranking. In Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval(Virtual Event, Canada)(SIGIR ’21). Association for Computing Machinery, New York, NY, USA, 2288–2292. doi:...

work page doi:10.1145/3404835.3463098 2021
[73]

Andrea Forte and Amy Bruckman. 2005. Why Do People Write for Wikipedia? Incentives to Contribute to Open-Content Publishing. (Nov. 2005)

work page 2005
[74]

2025.NEW CASE: Foxglove launches international legal challenge to Google’s worldwide theft of news!Foxglove

Foxglove Legal. 2025.NEW CASE: Foxglove launches international legal challenge to Google’s worldwide theft of news!Foxglove. https://foxglove.org.uk

work page 2025
[75]

2025.Perplexity accused of scraping websites that explicitly blocked AI scraping

Lorenzo Franceschi-Bicchierai. 2025.Perplexity accused of scraping websites that explicitly blocked AI scraping. TechCrunch. https://techcrunch. com/2025/08/04/perplexity-accused-of-scraping-websites-that-explicitly-blocked-ai-scraping/

work page 2025
[76]

Timnit Gebru, Jamie Morgenstern, Briana Vecchione, Jennifer Wortman Vaughan, Hanna Wallach, Hal Daumé III, and Kate Crawford. 2020. Datasheets for Datasets.arXiv:1803.09010 [cs](mar 2020). arXiv:1803.09010 [cs] http://arxiv.org/abs/1803.09010

work page arXiv 2020
[77]

Thomas Germain. 2025. Is Google about to Destroy the Web?BBC(jun 2025). https://www.bbc.com/future/article/20250611-ai-mode-is-google- about-to-change-the-internet-forever

work page arXiv 2025
[78]

Carlos Gil. 2024. Stop Chasing Algorithms — Here’s How Creators Can Take Control of Their Content and Monetize on Their Own Terms. https://www.entrepreneur.com/science-technology/why-relying-on-social-media-for-income-is-a-losing-game-for/481348

work page 2024
[79]

Dirk Groeneveld, Iz Beltagy, Pete Walsh, Akshita Bhagia, Rodney Kinney, Oyvind Tafjord, Ananya Harsh Jha, Hamish Ivison, Ian Magnusson, Yizhong Wang, Shane Arora, David Atkinson, Russell Authur, Khyathi Raghavi Chandu, Arman Cohan, Jennifer Dumas, Yanai Elazar, Yuling Gu, Jack Hessel, Tushar Khot, William Merrill, Jacob Morrison, Niklas Muennighoff, Aakan...

work page doi:10.48550/arxiv.2402.00838 2024
[80]

Roger Grosse, Juhan Bae, Cem Anil, Nelson Elhage, Alex Tamkin, Amirhossein Tajdini, Benoit Steiner, Dustin Li, Esin Durmus, Ethan Perez, Evan Hubinger, Kamil˙e Lukoši¯ut˙e, Karina Nguyen, Nicholas Joseph, Sam McCandlish, Jared Kaplan, and Samuel R. Bowman. 2023. Studying Large Language Model Generalization with Influence Functions. arXiv:2308.03296 [cs.LG...

work page arXiv 2023
[81]

Tarun Gupta and Danish Pruthi. 2025. All That Glitters is Not Novel: Plagiarism in AI Generated Research. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Wanxiang Che, Joyce Nabende, Ekaterina Shutova, and Mohammad Taher Pilehvar (Eds.). Association for Computational Linguistics, Vienna, A...

work page doi:10.18653/v1/2025.acl-long.1249 2025

Showing first 80 references.