Recognition: 2 theorem links
· Lean TheoremA Human-Centric Framework for Data Attribution in Large Language Models
Pith reviewed 2026-05-16 02:30 UTC · model grok-4.3
The pith
A framework lets creators, users and intermediaries negotiate data attribution parameters for LLMs.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The proposed human-centric data attribution framework situates the attribution problem within the broader data economy. Specific use cases for attribution, such as creative writing assistance or fact-checking, can be specified via a set of parameters including stakeholder objectives and implementation criteria. These criteria are up for negotiation by the relevant stakeholder groups: creators, LLM users, and their intermediaries. The outcome of domain-specific negotiations can be implemented and tested for whether the stakeholder goals are achieved.
What carries the argument
The negotiable parameter set of stakeholder objectives and implementation criteria that defines and tests domain-specific attribution use cases.
If this is right
- Attribution rules can be customized for particular applications such as creative writing or fact-checking.
- Negotiations can align incentives across creators, users, and platforms in the data economy.
- Methodological NLP techniques can be applied within governance structures defined by the negotiated criteria.
- Testing outcomes can indicate whether a sustainable equilibrium for data creators is reached.
Where Pith is reading between the lines
- The framework implies the need for new mechanisms or institutions to host and enforce the negotiations.
- Pilot implementations could be run on existing open models to measure measurable outcomes like creator compensation and user citation rates.
- Success would supply a concrete template that regulators could adopt or adapt for broader AI data rules.
Load-bearing premise
That creators, LLM users, and intermediaries can reach agreements on parameters and that the resulting criteria can be implemented and tested to meet their stated objectives.
What would settle it
Consistent failure of stakeholder negotiations to produce criteria that can be implemented in real LLM systems or that testing shows the criteria do not advance the stated goals of any group.
Figures
read the original abstract
In the current Large Language Model (LLM) ecosystem, creators have little agency over how their data is used, and LLM users may find themselves unknowingly plagiarizing existing sources. Attribution of LLM-generated text to LLM input data could help with these challenges, but so far we have more questions than answers: what elements of LLM outputs require attribution, what goals should it serve, how should it be implemented? We contribute a human-centric data attribution framework, which situates the attribution problem within the broader data economy. Specific use cases for attribution, such as creative writing assistance or fact-checking, can be specified via a set of parameters (including stakeholder objectives and implementation criteria). These criteria are up for negotiation by the relevant stakeholder groups: creators, LLM users, and their intermediaries (publishers, platforms, AI companies). The outcome of domain-specific negotiations can be implemented and tested for whether the stakeholder goals are achieved. The proposed approach provides a bridge between methodological NLP work on data attribution, governance work on policy interventions, and economic analysis of creator incentives for a sustainable equilibrium in the data economy.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes a human-centric data attribution framework for LLMs that situates attribution within the data economy. Use cases (e.g., creative writing or fact-checking) are specified via parameters capturing stakeholder objectives and implementation criteria; these parameters are negotiated among creators, users, and intermediaries (publishers, platforms, AI companies); the negotiated outcome is then implemented and tested against the original goals. The framework is positioned as a bridge linking NLP attribution methods, governance/policy interventions, and economic analysis of creator incentives.
Significance. If the framework can be operationalized with concrete mechanisms, it would offer a structured interdisciplinary lens for addressing attribution, potentially informing policy and technical standards that balance creator rights, user needs, and system performance in the LLM data economy.
major comments (2)
- [Framework description] The manuscript describes a sequence of 'specify use case, negotiate parameters, implement, test' but supplies no protocol or example for resolving objective conflicts (e.g., creator demands for verbatim provenance versus user demands for low-latency generation). This gap is load-bearing for the central bridging claim.
- [Implementation and testing phase] No explicit mapping is given from negotiated criteria to existing attribution techniques such as influence functions, data provenance tracking, or membership inference. Without this, the claimed integration with methodological NLP work remains an assertion rather than a demonstrated pathway.
minor comments (1)
- [Abstract] The abstract and introduction would benefit from a brief statement that the contribution is a conceptual proposal without empirical validation or code artifacts, to align reader expectations.
Simulated Author's Rebuttal
We thank the referee for their thoughtful review and constructive feedback on our manuscript. We address the major comments below and describe the revisions we intend to incorporate.
read point-by-point responses
-
Referee: [Framework description] The manuscript describes a sequence of 'specify use case, negotiate parameters, implement, test' but supplies no protocol or example for resolving objective conflicts (e.g., creator demands for verbatim provenance versus user demands for low-latency generation). This gap is load-bearing for the central bridging claim.
Authors: We recognize that the manuscript presents the negotiation process at a conceptual level without providing a detailed protocol or example for resolving conflicts between stakeholder objectives. This is a valid observation. In the revised version, we will add a dedicated subsection with a worked example of a negotiation scenario. For instance, we will illustrate how parameters for provenance requirements and latency constraints can be balanced through a multi-stakeholder negotiation process, including potential trade-offs and resolution mechanisms. This addition will strengthen the bridging claim by demonstrating the framework's applicability. revision: yes
-
Referee: [Implementation and testing phase] No explicit mapping is given from negotiated criteria to existing attribution techniques such as influence functions, data provenance tracking, or membership inference. Without this, the claimed integration with methodological NLP work remains an assertion rather than a demonstrated pathway.
Authors: We agree that an explicit mapping would enhance the manuscript's demonstration of integration with NLP methods. We will revise the manuscript to include a new table that maps sample negotiated criteria (such as 'high accuracy in source attribution' or 'minimal computational overhead') to relevant techniques like influence functions, data provenance tracking, and membership inference attacks, supported by citations to existing literature. This will provide a clearer pathway from the negotiated outcomes to implementation and testing. revision: yes
Circularity Check
No circularity: conceptual framework proposal without derivations or self-referential reductions
full rationale
The paper presents a high-level human-centric framework for data attribution in LLMs, describing a sequence of specifying use cases via stakeholder-negotiated parameters (objectives and criteria), followed by implementation and testing. No equations, fitted parameters, derivations, or mathematical claims exist in the text. The central assertion—that the framework bridges NLP attribution methods, policy governance, and economic incentives—is a forward-looking proposal rather than a result derived from or equivalent to its own inputs by construction. No self-citations are invoked as load-bearing uniqueness theorems, ansatzes, or prior fitted results. The framework remains self-contained as a descriptive structure without reducing any prediction or claim to a tautological fit or renaming of known patterns.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption Stakeholder groups can negotiate and agree on attribution parameters that achieve their objectives
- domain assumption Attribution criteria can be implemented and tested for goal achievement
invented entities (1)
-
Human-centric data attribution framework
no independent evidence
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We contribute a human-centric data attribution framework, which situates the attribution problem within the broader data economy. Specific use cases for attribution... can be specified via a set of parameters (including stakeholder objectives and implementation criteria). These criteria are up for negotiation by the relevant stakeholder groups...
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
At present, there are three broad groups of attribution criteria: similarity to existing content, causal influence on the model, and whether the data was used...
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
[n. d.]. 1000+ Authors for Libraries. https://www.fightforthefuture.org/authors-for-libraries
-
[2]
[n. d.]. AI Licensing for Authors: Who Owns the Rights and What’s a Fair Split? https://authorsguild.org/news/ai-licensing-for-authors-who- owns-the-rights-and-whats-a-fair-split/
-
[3]
Court of Justice of the European Union 2025.Case C-250/25,Like Company v. Google Ireland Limited. Court of Justice of the European Union. https: //curia.europa.eu/juris/showPdf.jsf?text=&docid=300681&pageIndex=0&doclang=EN&mode=req&dir=&occ=first&part=1&cid=5661670 Request lodged 3 April 2025; referring court: Budapest Környéki Törvényszék (Hungary); deci...
work page 2025
-
[4]
[n. d.]. Survey Reveals 90 Percent of Writers Believe Authors Should Be Compensated for the Use of Their Books in Training Generative AI. https://authorsguild.org/news/ai-survey-90-percent-of-writers-believe-authors-should-be-compensated-for-ai-training-use/
-
[5]
[n. d.]. WGA Agreement Introduces Key Protections for TV and Film Writers Against AI. https://authorsguild.org/news/wga-agreement- introduces-key-protections-for-tv-and-film-writers-against-ai/
-
[6]
Berne Convention for the Protection of Literary and Artistic Works
1979. Berne Convention for the Protection of Literary and Artistic Works. (sep 1979). https://www.wipo.int/wipolex/en/text/283693
work page 1979
-
[7]
2014. SPJ’s Code of Ethics. https://www.spj.org/spj-code-of-ethics/
work page 2014
- [8]
-
[9]
2022. Statement From Terrence Hart, General Counsel, Association of American Publishers on Disinformation in The Internet Archive Case - AAP. https://publishers.org/news/statement-from-terrence-hart-general-counsel-association-of-american-publishers-on-the-internet-archive-case/
work page 2022
-
[10]
The Authors Guild, John Grisham, Jodi Picoult, David Baldacci, George R.R
2023. The Authors Guild, John Grisham, Jodi Picoult, David Baldacci, George R.R. Martin, and 13 Other Authors File Class-Action Suit Against OpenAI. https://authorsguild.org/news/ag-and-authors-file-class-action-suit-against-openai/
work page 2023
-
[11]
2024. Stack Overflow and OpenAI Partner to Strengthen the World’s Most Popular Large Language Models - Press Release. https://stackoverflow. co/company/press/archive/openai-partnership
work page 2024
-
[12]
How Your Data Is Used to Improve Model Performance
2025. How Your Data Is Used to Improve Model Performance. https://help.openai.com/en/articles/5722486-how-your-data-is-used-to-improve- model-performance
-
[13]
IETF Working Group Will Further Develop Our Proposal for an Opt-out Vocabulary
2025. IETF Working Group Will Further Develop Our Proposal for an Opt-out Vocabulary. https://openfuture.eu/blog/ietf-working-group-will- further-develop-our-proposal-for-an-opt-out-vocabulary
work page 2025
-
[14]
Meta Wrongfully Disabling Accounts with No Human Customer Support
2025. Meta Wrongfully Disabling Accounts with No Human Customer Support. https://www.change.org/p/meta-wrongfully-disabling-accounts- with-no-human-customer-support
work page 2025
-
[15]
Microsoft Copilot Terms of Use
2025. Microsoft Copilot Terms of Use. https://www.microsoft.com/en-gb/microsoft-copilot/for-individuals/termsofuse
work page 2025
-
[16]
2025. Recommendations for the Conduct, Reporting, Editing, and Publication of Scholarly Work in Medical Journals. https://www.icmje.org/icmje- recommendations.pdf
work page 2025
-
[17]
Part 3: Generative AI Training (Pre-Publication Version)
2025.Report on Copyright and Artificial Intelligence. Part 3: Generative AI Training (Pre-Publication Version). Technical Report. U.S. Copyright Office. https://www.copyright.gov/ai/Copyright-and-Artificial-Intelligence-Part-3-Generative-AI-Training-Report-Pre-Publication-Version.pdf Manuscript submitted to ACM 16 Wührl et al
work page 2025
-
[18]
Mohamed Abdalla, Jan Philip Wahle, Terry Lima Ruas, Aur{\’e}lie N{\’e}v{\’e}ol, Fanny Ducel, Saif Mohammad, and Karen Fort. 2023. The Elephant in the Room: Analyzing the Presence of Big Tech in Natural Language Processing Research. InProceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Associatio...
work page 2023
-
[19]
Vincent Acovino. 2023. Sci-Fi Magazine Stops Submissions after Flood of AI Generated Stories.NPR(feb 2023). https://www.npr.org/2023/02/23/ 1159118948/sci-fi-magazine-stops-submissions-after-flood-of-ai-generated-stories
work page 2023
-
[20]
Mohiuddin Ahmed and Paul Haskell-Dowland. 2021. Is Google Getting Worse? Increased Advertising and Algorithm Changes May Make It Harder to Find What You’re Looking for. doi:10.64628/AA.av5ws3c54
-
[21]
AI watchdog. 2025. Content Licensing Deals. https://aiwatch.dog/licensing
work page 2025
-
[22]
Ekin Akyurek, Tolga Bolukbasi, Frederick Liu, Binbin Xiong, Ian Tenney, Jacob Andreas, and Kelvin Guu. 2022. Towards Tracing Knowledge in Language Models Back to the Training Data. InFindings of the Association for Computational Linguistics: EMNLP 2022, Yoav Goldberg, Zornitsa Kozareva, and Yue Zhang (Eds.). Association for Computational Linguistics, Abu ...
-
[23]
Davey Alba. 2025. Google Can Train Search AI With Web Content Even After Opt-Out.Bloomberg.com(may 2025). https://www.bloomberg.com/ news/articles/2025-05-03/google-can-train-search-ai-with-web-content-even-after-opt-out
work page 2025
-
[24]
Dataset Providers Alliance. 2024. Machine Learning AI Data Licensing. https://www.thedpa.ai
work page 2024
-
[25]
Denise Anthony, Sean W. Smith, and Timothy Williamson. 2009. Reputation and Reliability in Collective Goods: The Case of the Online Encyclopedia Wikipedia.Rationality and Society21, 3 (aug 2009), 283–306. doi:10.1177/1043463109336804
- [26]
-
[27]
Santiago Andrés Azcoitia, Costas Iordanou, and Nikolaos Laoutaris. 2023. Understanding the Price of Data in Commercial Data Marketplaces. In 2023 IEEE 39th International Conference on Data Engineering (ICDE). 3718–3728. doi:10.1109/ICDE55515.2023.00300
- [28]
-
[29]
Andy Baio. 2022. AI Data Laundering: How Academic and Nonprofit Researchers Shield Tech Companies from Accountability. https://waxy.org/ 2022/09/ai-data-laundering-how-academic-and-nonprofit-researchers-shield-tech-companies-from-accountability/
work page 2022
-
[30]
Jack Bandy and Nicholas Vincent. 2021. Addressing "Documentation Debt" in Machine Learning Research: A Retrospective Datasheet for BookCorpus.arXiv:2105.05241 [cs](may 2021). arXiv:2105.05241 [cs] http://arxiv.org/abs/2105.05241
-
[31]
Brian Barrett. 2026. The US Invaded Venezuela and Captured Nicolás Maduro. ChatGPT Disagrees.Wired(Jan. 2026). https://www.wired.com/ story/us-invaded-venezuela-and-captured-nicolas-maduro-chatgpt-disagrees/
work page 2026
-
[32]
Roland Barthes. 1988. The Death of the Author. InImage, Music, Text, Stephen Heath (Ed.). Noonday Press, 142–148. https://archive.org/details/ imagemusictext0000bart_e3d9/page/n7/mode/2up
work page 1988
-
[33]
Parishad BehnamGhader, Vaibhav Adlakha, Marius Mosbach, Dzmitry Bahdanau, Nicolas Chapados, and Siva Reddy. 2024. LLM2Vec: Large Language Models Are Secretly Powerful Text Encoders. InFirst Conference on Language Modeling. https://openreview.net/forum?id=IW1PR7vEBf
work page 2024
-
[34]
There Is Nothing Fair about This
Ashley Belanger. 2023. Grisham, Martin Join Authors Suing OpenAI: “There Is Nothing Fair about This” [Updated]. https://arstechnica.com/tech- policy/2023/09/george-r-r-martin-joins-authors-suing-openai-over-copyright-infringement/
work page 2023
-
[35]
Ashley Belanger. 2025. Lawsuit: Reddit Caught Perplexity “Red-Handed” Stealing Data from Google Results. https://arstechnica.com/tech- policy/2025/10/reddit-sues-to-block-perplexity-from-scraping-google-search-results/
work page 2025
-
[36]
Ashley Belanger. 2025. OpenAI Declares AI Race “over” If Training on Copyrighted Works Isn’t Fair Use. https://arstechnica.com/tech- policy/2025/03/openai-urges-trump-either-settle-ai-copyright-debate-or-lose-ai-race-to-china/
work page 2025
-
[37]
Emily M. Bender and Batya Friedman. 2018. Data Statements for Natural Language Processing: Toward Mitigating System Bias and Enabling Better Science.Transactions of the Association for Computational Linguistics6 (2018), 587–604. doi:10.1162/tacl_a_00041
-
[38]
Stella Biderman, Hailey Schoelkopf, Quentin Gregory Anthony, Herbie Bradley, Kyle O’Brien, Eric Hallahan, Mohammad Aflah Khan, Shivanshu Purohit, Usvsn Sai Prashanth, Edward Raff, Aviya Skowron, Lintang Sutawika, and Oskar Van Der Wal. 2023. Pythia: A Suite for Analyzing Large Language Models Across Training and Scaling. InProceedings of the 40th Internat...
work page 2023
-
[40]
ACM Publications Board. 2023. ACM Policy on Plagiarism, Misrepresentation, and Falsification. https://www.acm.org/publications/policies/ plagiarism-overview
work page 2023
-
[41]
Julie Bort. 2025. Perplexity CEO Says Its Browser Will Track Everything Users Do Online to Sell ’hyper Personalized’ Ads. https://techcrunch. com/2025/04/24/perplexity-ceo-says-its-browser-will-track-everything-users-do-online-to-sell-hyper-personalized-ads/
work page 2025
-
[42]
Russell Brandom. 2025. RSS Co-Creator Launches New Protocol for AI Data Licensing. https://techcrunch.com/2025/09/10/rss-co-creator-launches- new-protocol-for-ai-data-licensing/
work page 2025
-
[43]
John Brooks. 2020. The Dilemma of ’Free’: Facebook’s Monopsony Power and the Need For an Antitrust Renaissance. social science research network:3531172 doi:10.2139/ssrn.3531172 Manuscript submitted to ACM A Human-Centric Framework for Data Attribution in Large Language Models 17
-
[44]
Should I Stay or Should I Leave?
Allison J. Brown. 2020. “Should I Stay or Should I Leave?”: Exploring (Dis)Continued Facebook Use After the Cambridge Analytica Scandal.Social Media + Society6, 1 (jan 2020), 2056305120913884. doi:10.1177/2056305120913884
-
[45]
Amy Bruckman. 2002. Studying the Amateur Artist: A Perspective on Disguising Data Collected in Human Subjects Research on the Internet. Ethics and Information Technology4, 3 (Sept. 2002), 217–231. doi:10.1023/A:1021316409277
-
[46]
Ian Carlos Campbell. 2025. Perplexity Has Cooked up a New Way to Pay Publishers for Their Content. https://www.engadget.com/ai/perplexity- has-cooked-up-a-new-way-to-pay-publishers-for-their-content-204255019.html
work page 2025
-
[47]
Nicholas Carlini, Daphne Ippolito, Matthew Jagielski, Katherine Lee, Florian Tramer, and Chiyuan Zhang. 2022. Quantifying memorization across neural language models. InThe Eleventh International Conference on Learning Representations
work page 2022
-
[48]
Chang, Dheeraj Rajagopal, Tolga Bolukbasi, Lucas Dixon, and Ian Tenney
Tyler A. Chang, Dheeraj Rajagopal, Tolga Bolukbasi, Lucas Dixon, and Ian Tenney. 2024. Scalable Influence and Fact Tracing for Large Language Model Pretraining. InThe Thirteenth International Conference on Learning Representations
work page 2024
-
[49]
Aaron Chatterji, Thomas Cunningham, David J Deming, Zoe Hitzig, Christopher Ong, Carl Yan Shan, and Kevin Wadman. 2025. How People Use ChatGPT.National Bureau of Economic Research34255 (Sept. 2025). http://www.nber.org/papers/w34255
work page 2025
-
[50]
Sang Keun Choe, Hwijeen Ahn, Juhan Bae, Kewen Zhao, Minsoo Kang, Youngseog Chung, Adithya Pratapa, Willie Neiswanger, Emma Strubell, Teruko Mitamura, Jeff Schneider, Eduard Hovy, Roger Grosse, and Eric Xing. 2024. What Is Your Data Worth to GPT? LLM-Scale Data Valuation with Influence Functions. arXiv:2405.13954 [cs] doi:10.48550/arXiv.2405.13954
- [51]
-
[52]
Giuseppe Colangelo. 2022. Enforcing Copyright through Antitrust? The Strange Case of News Publishers against Digital Platforms.Journal of Antitrust Enforcement10, 1 (mar 2022), 133–161. doi:10.1093/jaenfo/jnab009
-
[53]
European Commission. 2025. Explanatory Notice and Template for the Public Summary of Training Content for General-Purpose AI Models | Shaping Europe’s Digital Future. https://digital-strategy.ec.europa.eu/en/library/explanatory-notice-and-template-public-summary-training- content-general-purpose-ai-models
work page 2025
-
[54]
Extracting memorized pieces of (copyrighted) books from open-weight language models
A. Feder Cooper, Aaron Gokaslan, Amy B. Cyphert, Christopher De Sa, Mark A. Lemley, Daniel E. Ho, and Percy Liang. 2025. Extracting Memorized Pieces of (Copyrighted) Books from Open-Weight Language Models. arXiv:2505.12546 [cs] doi:10.48550/arXiv.2505.12546
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2505.12546 2025
- [55]
-
[56]
Emilia David. 2024. OpenAI’s News Publisher Deals Reportedly Top out at $5 Million a Year. https://www.theverge.com/2024/1/4/24025409/openai- training-data-lowball-nyt-ai-copyright
work page 2024
-
[57]
de la Merced and Danielle Kaye
Andrew Ross SorkinBernhard WarnerSarah KesslerMichael J. de la Merced and Danielle Kaye. 2025. Exclusive: OpenAI Secures Another Giant Funding Deal.The New York Times(aug 2025). https://www.nytimes.com/2025/08/01/business/dealbook/openai-ai-mega-funding-deal.html
work page 2025
-
[58]
Chunyuan Deng, Yilun Zhao, Yuzhao Heng, Yitong Li, Jiannan Cao, Xiangru Tang, and Arman Cohan. 2024. Unveiling the Spectrum of Data Contamination in Language Model: A Survey from Detection to Remediation. InFindings of the Association for Computational Linguistics: ACL 2024, Lun-Wei Ku, Andre Martins, and Vivek Srikumar (Eds.). Association for Computation...
-
[59]
Junwei Deng, Yuzheng Hu, Pingbang Hu, Ting-wei Li, Shixuan Liu, Jiachen T. Wang, Dan Ley, Qirun Dai, Benhao Huang, Jin Huang, Cathy Jiao, Hoang Anh Just, Yijun Pan, Jingyan Shen, Yiwen Tu, Weiyi Wang, Xinhe Wang, Shichang Zhang, Shiyuan Zhang, Ruoxi Jia, Himabindu Lakkaraju, Hao Peng, Weijing Tang, Chenyan Xiong, Jieyu Zhao, Hanghang Tong, Han Zhao, and J...
-
[60]
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. InProceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). 4171–4186. https://aclw...
work page 2019
-
[61]
Josh Dickey. 2025. Penske Media Sues Google for AI ’Overview’ News Story Summaries Without Publishers’ Consent. https://www.thewrap.com/ penske-media-sues-google-ai-overview-news-story-summaries/
work page 2025
-
[62]
Cory Doctorow. 2025.Enshittification. Verso Books, London. https://guardianbookshop.com/enshittification-9781836742227/
work page 2025
-
[63]
Jesse Dodge, Maarten Sap, Ana Marasović, William Agnew, Gabriel Ilharco, Dirk Groeneveld, Margaret Mitchell, and Matt Gardner. 2021. Documenting Large Webtext Corpora: A Case Study on the Colossal Clean Crawled Corpus. InProceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, Onlin...
-
[64]
Editorial. 2024. The Evolution of Labor Law: A Comprehensive Historical Overview. https://lawslearned.com/history-of-labor-law/
work page 2024
-
[65]
Benj Edwards. 2024. Stack Overflow Users Sabotage Their Posts after OpenAI Deal. https://arstechnica.com/information-technology/2024/05/stack- overflow-users-sabotage-their-posts-after-openai-deal/
work page 2024
-
[66]
Eiko. 2022. Welcome to Hotel Elsevier: You Can Check-out Any Time You like . . . Not » Eiko Fried. https://eiko-fried.com/welcome-to-hotel- elsevier-you-can-check-out-any-time-you-like-not/
work page 2022
-
[67]
Yanai Elazar, Akshita Bhagia, Ian Helgi Magnusson, Abhilasha Ravichander, Dustin Schwenk, Alane Suhr, Evan Pete Walsh, Dirk Groeneveld, Luca Soldaini, Sameer Singh, Hannaneh Hajishirzi, Noah A. Smith, and Jesse Dodge. 2023. What’s In My Big Data?. InThe Twelfth International Conference on Learning Representations. https://openreview.net/forum?id=RvfPnOkPV...
work page 2023
-
[68]
Jordan, Ali Makhdoumi, and Azarakhsh Malekian
Alireza Fallah, Michael I. Jordan, Ali Makhdoumi, and Azarakhsh Malekian. 2024. On Three-Layer Data Markets. https://arxiv.org/abs/2402.09697v4
-
[69]
Henry Farrell, Alison Gopnik, Cosma Shalizi, and James Evans. 2025. Large AI Models Are Cultural and Social Technologies.Science387, 6739 (mar 2025), 1153–1156. doi:10.1126/science.adt9819
-
[70]
Sara Fischer. 2024. AI Startup TollBit Raises $24M Series A. https://www.axios.com/2024/10/22/ai-startup-tollbit-media-publishers
work page 2024
-
[71]
Richard Florida. 2022. The Rise of the Creator Economy. https://creativeclass.com/reports/The_Rise_of_the_Creator_Economy.pdf
work page 2022
-
[72]
Thibault Formal, Benjamin Piwowarski, and Stéphane Clinchant. 2021. SPLADE: Sparse Lexical and Expansion Model for First Stage Ranking. In Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval(Virtual Event, Canada)(SIGIR ’21). Association for Computing Machinery, New York, NY, USA, 2288–2292. doi:...
-
[73]
Andrea Forte and Amy Bruckman. 2005. Why Do People Write for Wikipedia? Incentives to Contribute to Open-Content Publishing. (Nov. 2005)
work page 2005
-
[74]
Foxglove Legal. 2025.NEW CASE: Foxglove launches international legal challenge to Google’s worldwide theft of news!Foxglove. https://foxglove.org.uk
work page 2025
-
[75]
2025.Perplexity accused of scraping websites that explicitly blocked AI scraping
Lorenzo Franceschi-Bicchierai. 2025.Perplexity accused of scraping websites that explicitly blocked AI scraping. TechCrunch. https://techcrunch. com/2025/08/04/perplexity-accused-of-scraping-websites-that-explicitly-blocked-ai-scraping/
work page 2025
- [76]
- [77]
-
[78]
Carlos Gil. 2024. Stop Chasing Algorithms — Here’s How Creators Can Take Control of Their Content and Monetize on Their Own Terms. https://www.entrepreneur.com/science-technology/why-relying-on-social-media-for-income-is-a-losing-game-for/481348
work page 2024
-
[79]
Dirk Groeneveld, Iz Beltagy, Pete Walsh, Akshita Bhagia, Rodney Kinney, Oyvind Tafjord, Ananya Harsh Jha, Hamish Ivison, Ian Magnusson, Yizhong Wang, Shane Arora, David Atkinson, Russell Authur, Khyathi Raghavi Chandu, Arman Cohan, Jennifer Dumas, Yanai Elazar, Yuling Gu, Jack Hessel, Tushar Khot, William Merrill, Jacob Morrison, Niklas Muennighoff, Aakan...
-
[80]
Roger Grosse, Juhan Bae, Cem Anil, Nelson Elhage, Alex Tamkin, Amirhossein Tajdini, Benoit Steiner, Dustin Li, Esin Durmus, Ethan Perez, Evan Hubinger, Kamil˙e Lukoši¯ut˙e, Karina Nguyen, Nicholas Joseph, Sam McCandlish, Jared Kaplan, and Samuel R. Bowman. 2023. Studying Large Language Model Generalization with Influence Functions. arXiv:2308.03296 [cs.LG...
-
[81]
Tarun Gupta and Danish Pruthi. 2025. All That Glitters is Not Novel: Plagiarism in AI Generated Research. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Wanxiang Che, Joyce Nabende, Ekaterina Shutova, and Mohammad Taher Pilehvar (Eds.). Association for Computational Linguistics, Vienna, A...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.