arxiv: 2601.02631 · v2 · pith:7W7UPEX4new · submitted 2026-01-06 · 💻 cs.CY

Copyright Laundering Through the AI Ouroboros: Adapting the 'Fruit of the Poisonous Tree' Doctrine to Recursive AI Training

Anirban Mukherjee , Hannah Hanwen Chang This is my paper

Pith reviewed 2026-05-16 17:51 UTC · model grok-4.3

classification 💻 cs.CY

keywords copyright infringementAI trainingfruit of the poisonous treerecursive AIsynthetic dataevidentiary standardsmodel derivation

0 comments p. Extension

Add this Pith Number to your LaTeX paper

What is a Pith Number?

\usepackage{pith}
\pithnumber{7W7UPEX4}

Prints a linked pith:7W7UPEX4 badge after your title and writes the identifier into PDF metadata. Compiles on arXiv with no extra files. Learn more

The pith

If a foundational AI model's training is infringing, later models derived from its outputs carry a rebuttable presumption of taint.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper adapts the fruit of the poisonous tree doctrine to multi-generational AI training that uses recursive synthetic data. When an early model is ruled to have infringed copyright through unlawful sourcing or non-transformative use, successor models principally built from its outputs or distilled weights are presumed tainted. This shifts the burden onto downstream developers to prove an independent lawful lineage or a curative rebuild. Readers would care because recursive pipelines diffuse original copyrighted material into statistical abstractions, creating an evidentiary blind spot that defeats conventional access-plus-substantial-similarity proof.

Core claim

The paper develops an AI-FOPT standard: if a foundational AI model's training is adjudged infringing, then subsequent AI models principally derived from the foundational model's outputs or distilled weights carry a rebuttable presumption of taint. The burden shifts to downstream developers to demonstrate a verifiably independent and lawfully sourced lineage or a curative rebuild. Absent such proof, commercial deployment of the tainted models and their outputs is actionable, while fair-use analysis remains confined to the initial ingestion stage.

What carries the argument

The AI-FOPT standard, which imposes a rebuttable presumption of taint on models principally derived from an infringing foundational model via recursive synthetic-data pipelines.

If this is right

Downstream developers must affirmatively prove independent lineage or curative unlearning to avoid liability.
Commercial deployment of models lacking such proof becomes actionable.
Fair-use analysis is preserved at the initial training stage rather than re-litigated at each generation.
The approach targets developers who control provenance records, making the rule administrable.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Developers may invest in standardized provenance logging and third-party audit protocols to meet rebuttal thresholds.
Courts could apply analogous presumptions to chained AI systems involving privacy or trade-secret claims.
Practical tests will emerge around whether technical unlearning verification scales without high error rates.

Load-bearing premise

Courts can reliably identify when a later model is principally derived from tainted outputs and that verifiable unlearning can be implemented and audited at scale without excessive cost or false negatives.

What would settle it

A copyright suit in which a downstream developer supplies auditable logs of independent training data or unlearning steps yet the court still imposes liability solely on the presumption, or conversely a case where the presumption is rebutted but later evidence reveals persistent copyrighted influence in outputs.

read the original abstract

Copyright enforcement rests on an evidentiary bargain: a plaintiff must show both the defendant's access to the work and substantial similarity in the challenged output. That bargain comes under strain when AI systems are trained through multi-generational pipelines with recursive synthetic data. As successive models are tuned on the outputs of its predecessors, any copyrighted material absorbed by an early model is diffused into deeper statistical abstractions. The result is an evidentiary blind spot where overlaps that emerge look coincidental, while the chain of provenance is too attenuated to trace. These conditions are ripe for "copyright laundering"--the use of multi-generational synthetic pipelines, an "AI Ouroboros," to render traditional proof of infringement impracticable. This Article adapts the "fruit of the poisonous tree" (FOPT) principle to propose a AI-FOPT standard: if a foundational AI model's training is adjudged infringing (either for unlawful sourcing or for non-transformative ingestion that fails fair-use), then subsequent AI models principally derived from the foundational model's outputs or distilled weights carry a rebuttable presumption of taint. The burden shifts to downstream developers--those who control the evidence of provenance--to restore the evidentiary bargain by affirmatively demonstrating a verifiably independent and lawfully sourced lineage or a curative rebuild, without displacing fair-use analysis at the initial ingestion stage. Absent such proof, commercial deployment of tainted models and their outputs is actionable. This Article develops the standard by specifying its trigger, presumption, and concrete rebuttal paths (e.g., independent lineage or verifiable unlearning); addresses counterarguments concerning chilling innovation and fair use; and demonstrates why this lineage-focused approach is both administrable and essential.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper proposes shifting the burden to downstream AI developers via a rebuttable presumption modeled on fruit of the poisonous tree when models are trained on synthetic outputs from an earlier infringing model.

read the letter

The core idea is straightforward: once a foundational model is ruled infringing, any later model that is principally derived from its outputs or distilled weights starts with a presumption of taint. Downstream developers then have to prove either an independent lawful lineage or that they performed a curative rebuild such as verifiable unlearning. The paper spells out the trigger, the presumption, and two main rebuttal routes, and it walks through why this keeps the traditional access-plus-substantial-similarity test from becoming useless in recursive pipelines. That mapping is new; prior copyright-AI work has not applied the doctrine this way to multi-generational synthetic data. The argument stays internally consistent and engages the obvious counterpoints about innovation chill and fair use at the first stage. It is a clean normative proposal that tries to restore an evidentiary balance without banning synthetic data outright. The main limitation is practical. The paper gives no operational test for “principally derived”—no similarity threshold on outputs, no distance metric on weights, no bound on false negatives for unlearning audits. Without those, courts or auditors would have no workable way to apply or contest the presumption. It also offers no data on how frequently recursive pipelines actually diffuse protected material, so the size of the problem is asserted rather than shown. This is for legal academics and policy people who want a concrete doctrinal fix for AI copyright enforcement. A reader looking for a structured way to think about provenance in synthetic-data chains will find it useful. The piece is coherent enough and engages the literature directly, so it deserves a serious referee even though the implementation details will need tightening.

Referee Report

2 major / 1 minor

Summary. The paper claims that recursive synthetic-data pipelines in AI training create an 'AI Ouroboros' enabling copyright laundering, where copyrighted material absorbed by an early model diffuses into later statistical abstractions and becomes untraceable under traditional access-plus-substantial-similarity tests. It proposes an AI-FOPT doctrine: when a foundational model is adjudged infringing (unlawful sourcing or non-transformative fair-use failure), any downstream model 'principally derived' from its outputs or distilled weights carries a rebuttable presumption of taint. The burden shifts to the downstream developer to prove independent lawful lineage or a curative rebuild (e.g., verifiable unlearning). The paper specifies the trigger, presumption, rebuttal paths, counters innovation-chilling and fair-use objections, and asserts the framework is both administrable and necessary to restore the evidentiary bargain.

Significance. If the proposed standard holds, the manuscript supplies a coherent normative framework that adapts established fruit-of-the-poisonous-tree principles to multi-generational AI pipelines, offering a lineage-focused mechanism to address evidentiary blind spots created by synthetic data. Its strength is the explicit mapping of trigger, presumption, and rebuttal paths onto existing doctrine without displacing initial fair-use analysis. The absence of empirical data on laundering frequency or rebuttal feasibility, however, leaves the practical significance dependent on future technical and judicial validation.

major comments (2)

[§ on specification of the AI-FOPT standard] § on specification of the AI-FOPT standard (trigger, presumption, and rebuttal paths): the central claim that the rebuttable presumption is 'administrable' rests on courts' ability to determine when a model is 'principally derived' from tainted outputs or weights, yet the text supplies no operational criteria—no similarity threshold on output distributions, no distance metric on distilled weights, and no false-negative bound for audits. This renders the evidentiary shift non-justiciable in practice and is load-bearing for the proposal.
[§ on concrete rebuttal paths] § on concrete rebuttal paths (verifiable unlearning and independent lineage): the discussion lists curative rebuilds as a rebuttal but provides no protocols, cost models, or scalability analysis for large-model verification. Without these, the burden shift cannot be implemented or contested, directly undermining the assertion that the standard restores the evidentiary bargain without excessive cost.

minor comments (1)

[Abstract] The abstract and introduction could more clearly flag the lack of empirical validation of laundering prevalence or rebuttal feasibility to calibrate reader expectations.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment below, providing clarifications on the scope and nature of our proposed AI-FOPT standard. We believe no major revisions to the core argument are required, as the paper offers a doctrinal framework rather than a technical blueprint.

read point-by-point responses

Referee: [§ on specification of the AI-FOPT standard] § on specification of the AI-FOPT standard (trigger, presumption, and rebuttal paths): the central claim that the rebuttable presumption is 'administrable' rests on courts' ability to determine when a model is 'principally derived' from tainted outputs or weights, yet the text supplies no operational criteria—no similarity threshold on output distributions, no distance metric on distilled weights, and no false-negative bound for audits. This renders the evidentiary shift non-justiciable in practice and is load-bearing for the proposal.

Authors: The manuscript proposes the AI-FOPT standard as a normative legal doctrine adapted from established principles in evidence and copyright law. Determinations of whether a model is 'principally derived' from tainted sources would rely on judicial assessment of available evidence, including training logs, model cards, and expert analysis, much like how courts evaluate substantial similarity or derivative works in traditional copyright cases. We intentionally avoid specifying technical thresholds or metrics, as these are matters for evidentiary development in litigation and technical standards bodies rather than fixed in the initial doctrinal proposal. This does not render the standard non-justiciable; presumptions in law often start without precise quantitative criteria and are refined through case law. We therefore maintain that the framework is administrable. revision: no
Referee: [§ on concrete rebuttal paths] § on concrete rebuttal paths (verifiable unlearning and independent lineage): the discussion lists curative rebuilds as a rebuttal but provides no protocols, cost models, or scalability analysis for large-model verification. Without these, the burden shift cannot be implemented or contested, directly undermining the assertion that the standard restores the evidentiary bargain without excessive cost.

Authors: Concrete rebuttal paths such as verifiable unlearning and demonstration of independent lineage are outlined conceptually to show feasible mechanisms for shifting the burden back to the party best positioned to provide evidence. Detailed protocols, cost models, and scalability analyses pertain to the technical implementation of these methods, which is an active area of research in machine learning (e.g., machine unlearning techniques). The paper does not purport to supply engineering specifications but argues that the legal standard would encourage the development and adoption of such methods. We acknowledge that practical costs and feasibility will need to be evaluated in future work, but this does not undermine the proposal's restoration of the evidentiary bargain at the doctrinal level. revision: no

Circularity Check

0 steps flagged

No circularity: normative legal proposal without derivation chain

full rationale

The paper advances a policy recommendation adapting the fruit-of-the-poisonous-tree doctrine to AI training pipelines. It specifies a trigger (infringing foundational model), a rebuttable presumption for downstream models, and rebuttal mechanisms (independent lineage or verifiable unlearning) as a normative framework. No equations, fitted parameters, self-referential definitions, or load-bearing self-citations appear in the provided text. The argument rests on existing legal principles and policy considerations rather than reducing any result to its own inputs by construction. The central claim is therefore self-contained as a proposal and does not exhibit any of the enumerated circularity patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

This is a normative legal proposal; it introduces no fitted parameters, no new physical or mathematical entities, and relies only on standard legal axioms about evidentiary burdens and doctrinal adaptation.

axioms (1)

domain assumption Copyright enforcement requires an evidentiary bargain of access plus substantial similarity that can be adapted to new technologies
Invoked in the opening paragraph as the foundation for identifying the laundering problem

pith-pipeline@v0.9.0 · 5608 in / 1277 out tokens · 43838 ms · 2026-05-16T17:51:08.670938+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

13 extracted references · 13 canonical work pages · 2 internal anchors

[1]

TOFU: A Task of Fictitious Unlearning for LLMs

(Order on Fair Use). By contrast, a model becomes a “poisonous tree” if it fails on either sourcing or use. It fails the use prong if its ingestion, even of lawfully accessed materials, is for a non-transformative purpose that creates a market substitute for the original. See Thomson Reuters Enter. Ctr. GmbH v. Ross Intel. Inc., 765 F. Supp. 3d 382, 397-9...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[2]

poisonous tree,

(Order Granting Partial Summary Judgment). 56 Lemley, supra note 15, at 264–65. especially at the point of creation and deployment when the vast range of downstream applications is unknown. The AI Ouroboros thus presents a scenario where the iterative abstraction, transformation, and commingling of sources across opaque technological layers demands an ada...

work page 1918
[3]

destruction under 17 U.S.C. § 503(b) of all GPT or other LLM models and training sets that incorporate Times Works

Trigger (Poisonous Tree): A court adjudges that a foundational model (poisonous tree) was trained via unauthorized copying not excused by fair use or other defenses. (S.D.N.Y. filed Dec. 27, 2023) (requesting “destruction under 17 U.S.C. § 503(b) of all GPT or other LLM models and training sets that incorporate Times Works”). See also Daniel Wilf- Townsen...

work page 2023
[4]

Derivation (Principally Derived): The plaintiff makes a prima facie showing that a challenged model is principally derived from the poisonous tree’s (or its successor models’) outputs or distilled weights (e.g., initialized, distilled, or merged from the poisonous tree or its tainted successors, materially reliant on synthetic data from the tainted lineag...

work page
[5]

The burden of production shifts under Fed

Presumption & Burdens: A rebuttable presumption of taint attaches. The burden of production shifts under Fed. R. Evid. 301. Given asymmetrical access to provenance, courts may treat rebuttal as an affirmative defense, placing the burden of persuasion on the developer

work page
[6]

Clean Lineage (auditable, license-cleared, independent training provenance), or b

Rebuttal Paths: By a preponderance, the developer shows either: a. Clean Lineage (auditable, license-cleared, independent training provenance), or b. Purged Taint (curative rebuild or effective unlearning), verified by pre- registered, performance-based audits admissible under Rule 702/Daubert

work page
[7]

Poisonous Tree

Remedies: If unrebutted, courts apply a calibrated ladder: targeted/component-level injunctions (including head-start relief), ongoing royalties or profits (17 U.S.C. § 504), and in exceptional cases impoundment/destruction (17 U.S.C. § 503), consistent with eBay. A. The Trigger: Adjudicated Illegality and the “Poisonous Tree” AI-FOPT activates only upon ...

work page 2025
[8]

clean lineage

(discussing the different tests in comparing an original and derivative works). 65 See Computer Assocs. Int’l, Inc. v. Altai, Inc., 982 F.2d 693, 706–11 (2d Cir. 1992). Federal Rule of Evidence 301, this presumption shifts the burden of production to the developer to come forward with evidence of a clean lineage or a curative rebuild.66 However, given the...

work page 1992
[9]

[T]he party against whom a presumption is directed has the burden of producing evidence to rebut the presumption

Independent Source (Clean Lineage). The most direct rebuttal is to demonstrate a verifiably independent training history for the allegedly tainted model (AI2). This requires 66 FED. R. EVID. 301 (“[T]he party against whom a presumption is directed has the burden of producing evidence to rebut the presumption.”). 67 See Gershwin Publ’g Corp. v. Columbia Ar...

work page
[10]

with knowledge of the infringing activity, induces, causes or materially contributes to the infringing conduct of another

(defining contributory infringement as occurring when one, “with knowledge of the infringing activity, induces, causes or materially contributes to the infringing conduct of another”). 68 See, e.g., Zubulake v. UBS Warburg LLC (Zubulake V), 229 F.R.D. 422, 436–37 (S.D.N.Y. 2004). 69 35 U.S.C. § 295. the production of auditable records—a “Provenance Packet...

work page 2004
[11]

scrubbing

Purged Taint (Curative Rebuild or Unlearning). Alternatively, a developer can prove that the inherited taint has been affirmatively and effectively purged. This requires more than superficial “scrubbing” or cosmetic alterations of AI1’s outputs, which would be insufficient to cure the taint just as minor changes to a copied photograph were insufficient in...

work page 2000
[12]

Machine Unlearning: A Comprehensive Survey

(affirming an injunction limited in duration to the time it would have taken the defendant to develop the product independently in a trade secret suit). 75 See 17 U.S.C. § 504(b); Sheldon v. Metro-Goldwyn Pictures Corp., 309 U.S. 390, 402–04 (1940). 76 See 17 U.S.C. § 503(b). 77 Winter v. Nat. Res. Def. Council, Inc., 555 U.S. 7, 20-21 (2008) (discussing ...

work page internal anchor Pith review Pith/arXiv arXiv 1940
[13]

laundering

(en banc) (presumption against extraterritorial application of the Copyright Act); Morrison v. Nat’l Australia Bank Ltd., 561 U.S. 247, 255–61 (2010) (articulating the modern two-step test); 17 U.S.C. § 602(a)(1) (importation). AI generations ingest the outputs of their predecessors, they create a technological hall of mirrors where infringement can becom...

work page 2010