pith. machine review for the scientific record. sign in

arxiv: 2604.04892 · v1 · submitted 2026-04-06 · 💻 cs.LG

Recognition: no theorem link

Data Attribution in Adaptive Learning

Authors on Pith no claims yet

Pith reviewed 2026-05-10 19:46 UTC · model grok-4.3

classification 💻 cs.LG
keywords data attributionadaptive learningreinforcement learningonline learningbanditsinterventional inferencecausal attributionlogged data
0
0 comments X

The pith

Occurrence-level attribution in adaptive learning requires a conditional interventional target that replay data cannot recover except in specific structural classes.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes a proper way to attribute individual data points in settings where a model generates its own future training data, such as bandits, reinforcement learning, and language model post-training. Standard attribution ignores the fact that each observation changes the distribution from which later data arrives. It shows that replaying logged data is generally insufficient to recover the true attribution and isolates the structural conditions under which logged data does suffice.

Core claim

We formalize occurrence-level attribution for finite-horizon adaptive learning via a conditional interventional target, prove that replay-side information cannot recover it in general, and identify a structural class in which the target is identified from logged data.

What carries the argument

The conditional interventional target, which isolates the effect of one specific occurrence while accounting for the adaptive feedback between learner updates and future data distributions.

If this is right

  • Standard attribution methods applied to logged adaptive data will generally give incorrect importance scores for individual observations.
  • Replay buffers alone are not enough to perform reliable attribution in most adaptive learning pipelines.
  • In the identified structural class, logged data contains enough information to recover the interventional attribution target without additional interaction.
  • Attribution for online bandits, reinforcement learning, and adaptive language model training must explicitly model the feedback loop between updates and data collection.
  • Design of data collection policies should consider whether they preserve identifiability of occurrence-level attributions.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Attribution tools used in post-training of large language models will need to incorporate explicit models of how the current model version shapes the distribution of future training examples.
  • The structural class identified in the paper could guide the creation of practical estimators that work on real logged data without requiring full re-simulation.
  • Extending the finite-horizon result to continuing or infinite-horizon adaptive processes would require new identification arguments that account for long-run distribution shifts.
  • Data attribution and causal effect estimation in adaptive systems are closely linked, suggesting that techniques from one area can directly inform the other.

Load-bearing premise

The conditional interventional target is the right formalization of occurrence-level attribution and the identified structural class covers the adaptive learning problems that matter in practice.

What would settle it

A concrete adaptive bandit or RL environment outside the structural class where applying standard attribution to replayed logs produces importance scores that differ from the interventional target computed with full knowledge of the policy.

read the original abstract

Machine learning models increasingly generate their own training data -- online bandits, reinforcement learning, and post-training pipelines for language models are leading examples. In these adaptive settings, a single training observation both updates the learner and shifts the distribution of future data the learner will collect. Standard attribution methods, designed for static datasets, ignore this feedback. We formalize occurrence-level attribution for finite-horizon adaptive learning via a conditional interventional target, prove that replay-side information cannot recover it in general, and identify a structural class in which the target is identified from logged data.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

0 major / 2 minor

Summary. The paper formalizes occurrence-level attribution for finite-horizon adaptive learning via a conditional interventional target. It proves that replay-side information cannot recover it in general, and identifies a structural class in which the target is identified from logged data. This addresses limitations of standard attribution methods in adaptive settings such as online bandits, reinforcement learning, and post-training for language models.

Significance. If the results hold, the paper makes a significant contribution by providing a rigorous framework for data attribution in adaptive learning, where feedback loops make standard methods inapplicable. The general non-identifiability result from replay-side information and the identification result for a structural class are valuable for both theoretical understanding and practical applications. The work is credited for introducing definitions from first principles and proving properties without circularity or reduction to fitted parameters.

minor comments (2)
  1. [Abstract] The abstract effectively summarizes the contributions but could benefit from a brief mention of the breadth of the identified structural class to better convey practical implications.
  2. Clarify the notation and definitions for the conditional interventional target early in the paper to aid readers from machine learning backgrounds less familiar with causal inference concepts.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the positive assessment of our work and the recommendation for minor revision. The referee correctly summarizes our formalization of occurrence-level attribution via a conditional interventional target, the general non-identifiability result from replay data, and the identification result under a specific structural class. We appreciate the recognition of the contribution to adaptive learning settings such as bandits, RL, and language model post-training. Since the report lists no specific major comments, we provide no point-by-point responses below and stand ready to incorporate any minor revisions requested by the editor.

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper introduces new definitions for occurrence-level attribution via a conditional interventional target in finite-horizon adaptive learning, then proves a general non-identifiability result for replay-side information and identifies a structural class recoverable from logged data. These elements are established through abstract formalization and first-principles proofs rather than any reduction to fitted parameters, self-referential equations, or load-bearing self-citations. The derivation chain remains self-contained with independent mathematical content that does not collapse to its own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on standard causal and probabilistic concepts with one new invented formal object; no free parameters are evident from the abstract.

axioms (1)
  • standard math Standard concepts from causal inference and probability for defining interventional distributions and conditional targets
    Invoked to formalize the attribution target and prove identifiability properties.
invented entities (1)
  • conditional interventional target no independent evidence
    purpose: To define occurrence-level attribution accounting for adaptive feedback in data collection
    New formal object introduced to capture the target quantity in adaptive settings.

pith-pipeline@v0.9.0 · 5368 in / 1217 out tokens · 75752 ms · 2026-05-10T19:46:50.443559+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

47 extracted references · 47 canonical work pages · 1 internal anchor

  1. [1]

    Juhan Bae, Nathan Ng, Alston Lo, Marzyeh Ghassemi, and Roger B. Grosse. If influence functions are the answer, then what is the question? InAdvances in Neural Information Processing Systems, volume 35, 2022

  2. [2]

    Charles, D

    Léon Bottou, Jonas Peters, Joaquín Quiñonero-Candela, Denis X. Charles, D. Max Chickering, Elon Portugaly, Dipankar Ray, Patrice Simard, and Ed Snelson. Counterfactual reasoning and learning systems: The example of computational advertising.Journal of Machine Learning Research, 14:3207–3260, 2013

  3. [3]

    Hoffman, and Edward J

    Iván Díaz, Nicholas Williams, Katherine L. Hoffman, and Edward J. Schenck. Nonparametric causal effects based on longitudinal modified treatment policies.Journal of the American Statistical Association, 118(542):846–857, 2023. Published online in 2021

  4. [4]

    Data shapley: Equitable valuation of data for machine learning

    Amirata Ghorbani and James Zou. Data shapley: Equitable valuation of data for machine learning. InProceedings of the 36th International Conference on Machine Learning, volume 97 ofProceedings of Machine Learning Research, pages 2242–2251. PMLR, 2019

  5. [5]

    Action shapley: A training data selection metric for world model in reinforcement learning.arXiv preprint arXiv:2601.10905, 2026

    Rajat Ghosh and Debojyoti Dutta. Action shapley: A training data selection metric for world model in reinforcement learning.arXiv preprint arXiv:2601.10905, 2026

  6. [6]

    Interpretable off-policy evaluation in reinforcement learning by highlighting influential transitions

    Omer Gottesman, Joseph Futoma, Yao Liu, Sonali Parbhoo, Leo Anthony Celi, Emma Brunskill, and Finale Doshi-Velez. Interpretable off-policy evaluation in reinforcement learning by highlighting influential transitions. InProceedings of the 37th International Conference on Machine Learning, volume 119 ofProceedings of Machine Learning Research, pages 3658–

  7. [7]

    Data cleansing for models trained with SGD

    Satoshi Hara, Atsushi Nitanda, and Takanori Maehara. Data cleansing for models trained with SGD. InAdvances in Neural Information Processing Systems, volume 32, pages 4215–4224, 2019

  8. [8]

    Which experiences are influential for RL agents? efficiently estimating the influence of experiences.arXiv preprint arXiv:2405.14629, 2024

    Takuya Hiraoka, Guanquan Wang, Takashi Onishi, and Yoshimasa Tsuruoka. Which experiences are influential for RL agents? efficiently estimating the influence of experiences.arXiv preprint arXiv:2405.14629, 2024

  9. [9]

    Ma, and Han Zhao

    Yuzheng Hu, Fan Wu, Haotian Ye, David Forsyth, James Zou, Nan Jiang, Jiaqi W. Ma, and Han Zhao. A snapshot of influence: A local data attribution framework for online reinforcement learning. InAdvances in Neural Information Processing Systems, 2025. NeurIPS 2025 Oral

  10. [10]

    Understanding black-box predictions via influence functions

    Pang Wei Koh and Percy Liang. Understanding black-box predictions via influence functions. InProceedings of the 34th International Conference on Machine Learning, volume 70 of Proceedings of Machine Learning Research, pages 1885–1894. PMLR, 2017

  11. [11]

    Shixuan Liu, Yuzheng Hu, Han Zhao, and Jiaqi W. Ma. Non-local data attribution for on-policy reinforcement learning. InICLR 2026 Workshop on Data Problems for Foundation Models, 2026

  12. [12]

    Erdogdu, Richard E

    Bruno Kacper Mlodozeniec, Isaac Reid, Samuel Power, David Krueger, Murat A. Erdogdu, Richard E. Turner, and Roger B. Grosse. Distributional training data attribution.arXiv preprint arXiv:2506.12965, 2025. NeurIPS 2025 Spotlight

  13. [13]

    Estimating training data influence by tracing gradient descent

    Garima Pruthi, Frederick Liu, Satyen Kale, and Mukund Sundararajan. Estimating training data influence by tracing gradient descent. InAdvances in Neural Information Processing Systems, volume 33, pages 19920–19930. Curran Associates, Inc., 2020

  14. [14]

    Theoretical and practical perspectives on what influence functions do

    Andrea Schioppa, Katja Filippova, Ivan Titov, and Polina Zablotskaia. Theoretical and practical perspectives on what influence functions do. InAdvances in Neural Information Processing Systems, volume 36, 2023

  15. [15]

    Learning from the right rollouts: Data attribution for ppo-based llm post-training.arXiv preprint arXiv:2604.01597, 2026

    Dong Shu, Denghui Zhang, and Jessica Hullman. Learning from the right rollouts: Data attribution for PPO-based LLM post-training.arXiv preprint arXiv:2604.01597, 2026. 12

  16. [16]

    Understanding data influence in reinforcement finetuning

    Haoru Tan, Xiuzhe Wu, Sitong Wu, Shaofeng Zhang, Yanfeng Chen, Xingwu Sun, Jeanne Shen, and Xiaojuan Qi. Understanding data influence in reinforcement finetuning. InAdvances in Neural Information Processing Systems, 2025. NeurIPS 2025 Poster

  17. [17]

    Wang, Prateek Mittal, Dawn Song, and Ruoxi Jia

    Jiachen T. Wang, Prateek Mittal, Dawn Song, and Ruoxi Jia. Data shapley in one training run. InThe Thirteenth International Conference on Learning Representations, 2025. ICLR 2025 Oral

  18. [18]

    Wang, Dawn Song, James Y

    Jiachen T. Wang, Dawn Song, James Y . Zou, Prateek Mittal, and Ruoxi Jia. Capturing the temporal dependence of training data influence. InThe Thirteenth International Conference on Learning Representations, 2025. ICLR 2025 Oral

  19. [19]

    Data-Efficient RLVR via Off-Policy Influence Guidance

    Erle Zhu, Dazhi Jiang, Yuan Wang, Xujun Li, Jiale Cheng, Yuxian Gu, Yilin Niu, Aohan Zeng, Jie Tang, Minlie Huang, and Hongning Wang. Data-efficient RLVR via off-policy influence guidance.arXiv preprint arXiv:2510.26491, 2025. A Additional discussion of related work This appendix expands on the literature discussion from the main text, focusing on where t...

  20. [20]

    = 0, while under the one-coordinate perturbationw (1,ϵ), θ(w(1,ϵ)) 2 (z⋆

  21. [21]

    For the second round, define the update and target by U2(θ, z2,1) :=z 2, F(θ 3) :=θ 3

    =ϵ. For the second round, define the update and target by U2(θ, z2,1) :=z 2, F(θ 3) :=θ 3. Sincez 2 ∈ {0,1}, this means that the terminal value under any full history(z ⋆ 1 , z2)is justz 2. For each parameterγ∈R, define an environmentν γ by the round-2kernel K θ νγ ,2(1|z ⋆

  22. [22]

    =σ(γθ), K θ νγ ,2(0|z ⋆

  23. [23]

    All these objects are smooth in the obvious sense

    = 1−σ(γθ), whereσ(x) = 1/(1 +e −x)is the logistic sigmoid. All these objects are smooth in the obvious sense. The update maps are smooth in their real arguments, the round-2kernel masses are smooth functions ofθ, andFis linear. We now compare the replay oracles at the realized prefixz ⋆ 1. First, we compute the baseline future law. Under the baseline proc...

  24. [24]

    Therefore Q0 νγ ,1(1|z ⋆

    = 0. Therefore Q0 νγ ,1(1|z ⋆

  25. [25]

    =σ(0) = 1 2 , and similarly Q0 νγ ,1(0|z ⋆

  26. [26]

    Thus the baseline conditional future law atz ⋆ 1 is the same for everyγ

    = 1 2 . Thus the baseline conditional future law atz ⋆ 1 is the same for everyγ. Next, we compute the replay response curves. Fixc∈ {0,1}. By construction, θ(w(1,ϵ)) 3 (z⋆ 1 , c) =U 2 θ(w(1,ϵ)) 2 (z⋆ 1), c,1 =c. Therefore ϕνγ ,1,c(ϵ;z ⋆

  27. [27]

    So the entire family of fixed-log replay response curves is also the same for everyγ

    =F θ(w(1,ϵ)) 3 (z⋆ 1 , c) =c for every ϵ and every γ. So the entire family of fixed-log replay response curves is also the same for everyγ. We have shown that for everyα, β∈R, Rνα,1(z⋆

  28. [28]

    We now compute the finite interventional target

    =R νβ ,1(z⋆ 1). We now compute the finite interventional target. Under the perturbationw (1,ϵ), the state at time2is θϵ 2 =ϵ. Therefore, conditional on the realized prefixz ⋆ 1, Qϵ νγ ,1(1|z ⋆

  29. [29]

    Since the terminal target equals the second-round interaction, Ψϵ νγ ,1(z⋆

    =σ(γϵ). Since the terminal target equals the second-round interaction, Ψϵ νγ ,1(z⋆

  30. [30]

    = X c∈{0,1} Qϵ νγ ,1(c|z ⋆ 1)c=Q ϵ νγ ,1(1|z ⋆

  31. [31]

    Hence forα̸=β, Ψϵ να,1(z⋆ 1)̸= Ψ ϵ νβ ,1(z⋆ 1) wheneverαϵ̸=βϵ

    =σ(γϵ). Hence forα̸=β, Ψϵ να,1(z⋆ 1)̸= Ψ ϵ νβ ,1(z⋆ 1) wheneverαϵ̸=βϵ. Differentiating atϵ= 0, we obtain I int 1,νγ(F;z ⋆

  32. [32]

    20 Thus forα̸=β, I int 1,να(F;z ⋆ 1)̸=I int 1,νβ(F;z ⋆ 1)

    = d dϵ σ(γϵ) ϵ=0 =γσ ′(0) = γ 4 . 20 Thus forα̸=β, I int 1,να(F;z ⋆ 1)̸=I int 1,νβ(F;z ⋆ 1). Finally, suppose for contradiction that over this class the conditional interventional target were identified by a functional of the replay oracle. Then there would exist a measurable map Φ such that for everyγ, I int 1,νγ(F;z ⋆

  33. [33]

    But the replay oracles agree forν α andν β, so this would imply I int 1,να(F;z ⋆

    = Φ Rνγ ,1(z⋆ 1) . But the replay oracles agree forν α andν β, so this would imply I int 1,να(F;z ⋆

  34. [34]

    Therefore the target is not identified by any functional of the replay oracle over this class

    =I int 1,νβ(F;z ⋆ 1), contradicting the calculation above. Therefore the target is not identified by any functional of the replay oracle over this class. E Proofs for Section 6 E.1 Proof of Theorem 3 Proof. Fix t∈ {1, . . . , T} , fix a realized prefixh=z 1:t ∈ H t with Pν(h)>0 , and fix a continuation c= ((x t+1, at+1, rt+1), . . . ,(xT , aT , rT ))∈ C t...

  35. [35]

    =ϵ For each parameterγ∈R, we define an environmentν γ by its reward law: P θ γ (R2 = 1|z ⋆

  36. [36]

    =σ(γθ), P θ γ (R2 = 0|z ⋆

  37. [37]

    Because context and action spaces are singletons, this satisfies the factorization for reward-state dependence.We first compute the baseline law

    = 1−σ(γθ) where σ is the logistic sigmoid function. Because context and action spaces are singletons, this satisfies the factorization for reward-state dependence.We first compute the baseline law. Under the baseline process (wherew 1 = 1), the state entering round2isθ 2(z⋆

  38. [38]

    Hence: P 0 γ (R2 = 1|z ⋆

    = 0. Hence: P 0 γ (R2 = 1|z ⋆

  39. [39]

    =σ(0) = 1 2 , P 0 γ (R2 = 0|z ⋆

  40. [40]

    Thus, for any two environments α, β∈R , their baseline laws are identical: Pνα =P νβ .We now compute the interventional target at the realized prefixz⋆

    = 1 2 Since the first round is deterministic, the full baseline law onH 2 is: Pνγ(z⋆ 1 ,1) = 1 2 ,P νγ(z⋆ 1 ,0) = 1 2 This is entirely independent of γ. Thus, for any two environments α, β∈R , their baseline laws are identical: Pνα =P νβ .We now compute the interventional target at the realized prefixz⋆

  41. [41]

    Under the perturbationw (1,ϵ), the state at round2isϵ, so the perturbed future law is: Qϵ νγ ,1(1|z ⋆

  42. [42]

    Therefore, the finite conditional interventional effect is: Ψϵ νγ ,1(z⋆

    =σ(γϵ) Because U2(θ, r,1) =r and F(θ 3) =θ 3, the terminal target equals the round-2 reward. Therefore, the finite conditional interventional effect is: Ψϵ νγ ,1(z⋆

  43. [43]

    =σ(γϵ) Differentiating at0yields the influence: I int 1,νγ(F;z ⋆

  44. [44]

    Since the values are different, this is impossible

    = d dϵ σ(γϵ) ϵ=0 = γ 4 Hence, forα̸=β, their interventional targets strictly diverge: I int 1,να(F;z ⋆ 1)̸=I int 1,νβ(F;z ⋆ 1) If the target were identified from the baseline law over this class, it would have to take the same value on να and νβ because their baseline laws agree exactly. Since the values are different, this is impossible. Extension to Con...

  45. [45]

    The baseline laws again remain perfectly identical across all γ, but the interventional targets diverge exactly as shown above

    =σ(γθ) . The baseline laws again remain perfectly identical across all γ, but the interventional targets diverge exactly as shown above. Therefore, identification fails in both cases. 23 F Proofs for section 7 F.1 Proofs for the Directional Failure of Replay (Theorem 4) To formally prove the directional failure of replay in the horizon-2 bandit (Theorem 4...

  46. [46]

    For the interventional target, we evaluate the recollected expectation given the perturbed intermediate policy pϵ 2, so Ψϵ 1(z⋆

    =c q,η1,p G′ µ,η2(p) EPµ[I rep 1 (F;Z 1:2)|Z 1 =z ⋆ 1] =c q,η1,p Rµ,η2(p) Proof. For the interventional target, we evaluate the recollected expectation given the perturbed intermediate policy pϵ 2, so Ψϵ 1(z⋆

  47. [47]

    three gears

    =G µ,η2(pϵ 2). Applying the chain rule yields G′ µ,η2(p)· d dϵ pϵ 2 ϵ=0, which equalsc q,η1,p G′ µ,η2(p). For the replay target, we evaluate the derivative along each fixed path and then take the baseline expectation. If (A2, R2) = (1,1) , the final policy is fη2(pϵ 2), and its derivative at 0 is cq,η1,pf ′ η2(p). Weighting the derivatives of all four pos...