arxiv: 2604.04892 · v1 · submitted 2026-04-06 · 💻 cs.LG

Recognition: no theorem link

Data Attribution in Adaptive Learning

Amit Kiran Rege

Authors on Pith no claims yet

Pith reviewed 2026-05-10 19:46 UTC · model grok-4.3

classification 💻 cs.LG

keywords data attributionadaptive learningreinforcement learningonline learningbanditsinterventional inferencecausal attributionlogged data

0 comments

The pith

Occurrence-level attribution in adaptive learning requires a conditional interventional target that replay data cannot recover except in specific structural classes.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes a proper way to attribute individual data points in settings where a model generates its own future training data, such as bandits, reinforcement learning, and language model post-training. Standard attribution ignores the fact that each observation changes the distribution from which later data arrives. It shows that replaying logged data is generally insufficient to recover the true attribution and isolates the structural conditions under which logged data does suffice.

Core claim

We formalize occurrence-level attribution for finite-horizon adaptive learning via a conditional interventional target, prove that replay-side information cannot recover it in general, and identify a structural class in which the target is identified from logged data.

What carries the argument

The conditional interventional target, which isolates the effect of one specific occurrence while accounting for the adaptive feedback between learner updates and future data distributions.

If this is right

Standard attribution methods applied to logged adaptive data will generally give incorrect importance scores for individual observations.
Replay buffers alone are not enough to perform reliable attribution in most adaptive learning pipelines.
In the identified structural class, logged data contains enough information to recover the interventional attribution target without additional interaction.
Attribution for online bandits, reinforcement learning, and adaptive language model training must explicitly model the feedback loop between updates and data collection.
Design of data collection policies should consider whether they preserve identifiability of occurrence-level attributions.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Attribution tools used in post-training of large language models will need to incorporate explicit models of how the current model version shapes the distribution of future training examples.
The structural class identified in the paper could guide the creation of practical estimators that work on real logged data without requiring full re-simulation.
Extending the finite-horizon result to continuing or infinite-horizon adaptive processes would require new identification arguments that account for long-run distribution shifts.
Data attribution and causal effect estimation in adaptive systems are closely linked, suggesting that techniques from one area can directly inform the other.

Load-bearing premise

The conditional interventional target is the right formalization of occurrence-level attribution and the identified structural class covers the adaptive learning problems that matter in practice.

What would settle it

A concrete adaptive bandit or RL environment outside the structural class where applying standard attribution to replayed logs produces importance scores that differ from the interventional target computed with full knowledge of the policy.

read the original abstract

Machine learning models increasingly generate their own training data -- online bandits, reinforcement learning, and post-training pipelines for language models are leading examples. In these adaptive settings, a single training observation both updates the learner and shifts the distribution of future data the learner will collect. Standard attribution methods, designed for static datasets, ignore this feedback. We formalize occurrence-level attribution for finite-horizon adaptive learning via a conditional interventional target, prove that replay-side information cannot recover it in general, and identify a structural class in which the target is identified from logged data.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

0 major / 2 minor

Summary. The paper formalizes occurrence-level attribution for finite-horizon adaptive learning via a conditional interventional target. It proves that replay-side information cannot recover it in general, and identifies a structural class in which the target is identified from logged data. This addresses limitations of standard attribution methods in adaptive settings such as online bandits, reinforcement learning, and post-training for language models.

Significance. If the results hold, the paper makes a significant contribution by providing a rigorous framework for data attribution in adaptive learning, where feedback loops make standard methods inapplicable. The general non-identifiability result from replay-side information and the identification result for a structural class are valuable for both theoretical understanding and practical applications. The work is credited for introducing definitions from first principles and proving properties without circularity or reduction to fitted parameters.

minor comments (2)

[Abstract] The abstract effectively summarizes the contributions but could benefit from a brief mention of the breadth of the identified structural class to better convey practical implications.
Clarify the notation and definitions for the conditional interventional target early in the paper to aid readers from machine learning backgrounds less familiar with causal inference concepts.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the positive assessment of our work and the recommendation for minor revision. The referee correctly summarizes our formalization of occurrence-level attribution via a conditional interventional target, the general non-identifiability result from replay data, and the identification result under a specific structural class. We appreciate the recognition of the contribution to adaptive learning settings such as bandits, RL, and language model post-training. Since the report lists no specific major comments, we provide no point-by-point responses below and stand ready to incorporate any minor revisions requested by the editor.

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper introduces new definitions for occurrence-level attribution via a conditional interventional target in finite-horizon adaptive learning, then proves a general non-identifiability result for replay-side information and identifies a structural class recoverable from logged data. These elements are established through abstract formalization and first-principles proofs rather than any reduction to fitted parameters, self-referential equations, or load-bearing self-citations. The derivation chain remains self-contained with independent mathematical content that does not collapse to its own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on standard causal and probabilistic concepts with one new invented formal object; no free parameters are evident from the abstract.

axioms (1)

standard math Standard concepts from causal inference and probability for defining interventional distributions and conditional targets
Invoked to formalize the attribution target and prove identifiability properties.

invented entities (1)

conditional interventional target no independent evidence
purpose: To define occurrence-level attribution accounting for adaptive feedback in data collection
New formal object introduced to capture the target quantity in adaptive settings.

pith-pipeline@v0.9.0 · 5368 in / 1217 out tokens · 75752 ms · 2026-05-10T19:46:50.443559+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

47 extracted references · 47 canonical work pages · 1 internal anchor

[1]

Juhan Bae, Nathan Ng, Alston Lo, Marzyeh Ghassemi, and Roger B. Grosse. If influence functions are the answer, then what is the question? InAdvances in Neural Information Processing Systems, volume 35, 2022

work page 2022
[2]

Charles, D

Léon Bottou, Jonas Peters, Joaquín Quiñonero-Candela, Denis X. Charles, D. Max Chickering, Elon Portugaly, Dipankar Ray, Patrice Simard, and Ed Snelson. Counterfactual reasoning and learning systems: The example of computational advertising.Journal of Machine Learning Research, 14:3207–3260, 2013

work page 2013
[3]

Hoffman, and Edward J

Iván Díaz, Nicholas Williams, Katherine L. Hoffman, and Edward J. Schenck. Nonparametric causal effects based on longitudinal modified treatment policies.Journal of the American Statistical Association, 118(542):846–857, 2023. Published online in 2021

work page 2023
[4]

Data shapley: Equitable valuation of data for machine learning

Amirata Ghorbani and James Zou. Data shapley: Equitable valuation of data for machine learning. InProceedings of the 36th International Conference on Machine Learning, volume 97 ofProceedings of Machine Learning Research, pages 2242–2251. PMLR, 2019

work page 2019
[5]

Action shapley: A training data selection metric for world model in reinforcement learning.arXiv preprint arXiv:2601.10905, 2026

Rajat Ghosh and Debojyoti Dutta. Action shapley: A training data selection metric for world model in reinforcement learning.arXiv preprint arXiv:2601.10905, 2026

work page arXiv 2026
[6]

Interpretable off-policy evaluation in reinforcement learning by highlighting influential transitions

Omer Gottesman, Joseph Futoma, Yao Liu, Sonali Parbhoo, Leo Anthony Celi, Emma Brunskill, and Finale Doshi-Velez. Interpretable off-policy evaluation in reinforcement learning by highlighting influential transitions. InProceedings of the 37th International Conference on Machine Learning, volume 119 ofProceedings of Machine Learning Research, pages 3658–

work page
[7]

Data cleansing for models trained with SGD

Satoshi Hara, Atsushi Nitanda, and Takanori Maehara. Data cleansing for models trained with SGD. InAdvances in Neural Information Processing Systems, volume 32, pages 4215–4224, 2019

work page 2019
[8]

Which experiences are influential for RL agents? efficiently estimating the influence of experiences.arXiv preprint arXiv:2405.14629, 2024

Takuya Hiraoka, Guanquan Wang, Takashi Onishi, and Yoshimasa Tsuruoka. Which experiences are influential for RL agents? efficiently estimating the influence of experiences.arXiv preprint arXiv:2405.14629, 2024

work page arXiv 2024
[9]

Ma, and Han Zhao

Yuzheng Hu, Fan Wu, Haotian Ye, David Forsyth, James Zou, Nan Jiang, Jiaqi W. Ma, and Han Zhao. A snapshot of influence: A local data attribution framework for online reinforcement learning. InAdvances in Neural Information Processing Systems, 2025. NeurIPS 2025 Oral

work page 2025
[10]

Understanding black-box predictions via influence functions

Pang Wei Koh and Percy Liang. Understanding black-box predictions via influence functions. InProceedings of the 34th International Conference on Machine Learning, volume 70 of Proceedings of Machine Learning Research, pages 1885–1894. PMLR, 2017

work page 2017
[11]

Shixuan Liu, Yuzheng Hu, Han Zhao, and Jiaqi W. Ma. Non-local data attribution for on-policy reinforcement learning. InICLR 2026 Workshop on Data Problems for Foundation Models, 2026

work page 2026
[12]

Erdogdu, Richard E

Bruno Kacper Mlodozeniec, Isaac Reid, Samuel Power, David Krueger, Murat A. Erdogdu, Richard E. Turner, and Roger B. Grosse. Distributional training data attribution.arXiv preprint arXiv:2506.12965, 2025. NeurIPS 2025 Spotlight

work page arXiv 2025
[13]

Estimating training data influence by tracing gradient descent

Garima Pruthi, Frederick Liu, Satyen Kale, and Mukund Sundararajan. Estimating training data influence by tracing gradient descent. InAdvances in Neural Information Processing Systems, volume 33, pages 19920–19930. Curran Associates, Inc., 2020

work page 2020
[14]

Theoretical and practical perspectives on what influence functions do

Andrea Schioppa, Katja Filippova, Ivan Titov, and Polina Zablotskaia. Theoretical and practical perspectives on what influence functions do. InAdvances in Neural Information Processing Systems, volume 36, 2023

work page 2023
[15]

Learning from the right rollouts: Data attribution for ppo-based llm post-training.arXiv preprint arXiv:2604.01597, 2026

Dong Shu, Denghui Zhang, and Jessica Hullman. Learning from the right rollouts: Data attribution for PPO-based LLM post-training.arXiv preprint arXiv:2604.01597, 2026. 12

work page arXiv 2026
[16]

Understanding data influence in reinforcement finetuning

Haoru Tan, Xiuzhe Wu, Sitong Wu, Shaofeng Zhang, Yanfeng Chen, Xingwu Sun, Jeanne Shen, and Xiaojuan Qi. Understanding data influence in reinforcement finetuning. InAdvances in Neural Information Processing Systems, 2025. NeurIPS 2025 Poster

work page 2025
[17]

Wang, Prateek Mittal, Dawn Song, and Ruoxi Jia

Jiachen T. Wang, Prateek Mittal, Dawn Song, and Ruoxi Jia. Data shapley in one training run. InThe Thirteenth International Conference on Learning Representations, 2025. ICLR 2025 Oral

work page 2025
[18]

Wang, Dawn Song, James Y

Jiachen T. Wang, Dawn Song, James Y . Zou, Prateek Mittal, and Ruoxi Jia. Capturing the temporal dependence of training data influence. InThe Thirteenth International Conference on Learning Representations, 2025. ICLR 2025 Oral

work page 2025
[19]

Data-Efficient RLVR via Off-Policy Influence Guidance

Erle Zhu, Dazhi Jiang, Yuan Wang, Xujun Li, Jiale Cheng, Yuxian Gu, Yilin Niu, Aohan Zeng, Jie Tang, Minlie Huang, and Hongning Wang. Data-efficient RLVR via off-policy influence guidance.arXiv preprint arXiv:2510.26491, 2025. A Additional discussion of related work This appendix expands on the literature discussion from the main text, focusing on where t...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[20]

= 0, while under the one-coordinate perturbationw (1,ϵ), θ(w(1,ϵ)) 2 (z⋆

work page
[21]

For the second round, define the update and target by U2(θ, z2,1) :=z 2, F(θ 3) :=θ 3

=ϵ. For the second round, define the update and target by U2(θ, z2,1) :=z 2, F(θ 3) :=θ 3. Sincez 2 ∈ {0,1}, this means that the terminal value under any full history(z ⋆ 1 , z2)is justz 2. For each parameterγ∈R, define an environmentν γ by the round-2kernel K θ νγ ,2(1|z ⋆

work page
[22]

=σ(γθ), K θ νγ ,2(0|z ⋆

work page
[23]

All these objects are smooth in the obvious sense

= 1−σ(γθ), whereσ(x) = 1/(1 +e −x)is the logistic sigmoid. All these objects are smooth in the obvious sense. The update maps are smooth in their real arguments, the round-2kernel masses are smooth functions ofθ, andFis linear. We now compare the replay oracles at the realized prefixz ⋆ 1. First, we compute the baseline future law. Under the baseline proc...

work page
[24]

Therefore Q0 νγ ,1(1|z ⋆

= 0. Therefore Q0 νγ ,1(1|z ⋆

work page
[25]

=σ(0) = 1 2 , and similarly Q0 νγ ,1(0|z ⋆

work page
[26]

Thus the baseline conditional future law atz ⋆ 1 is the same for everyγ

= 1 2 . Thus the baseline conditional future law atz ⋆ 1 is the same for everyγ. Next, we compute the replay response curves. Fixc∈ {0,1}. By construction, θ(w(1,ϵ)) 3 (z⋆ 1 , c) =U 2 θ(w(1,ϵ)) 2 (z⋆ 1), c,1 =c. Therefore ϕνγ ,1,c(ϵ;z ⋆

work page
[27]

So the entire family of fixed-log replay response curves is also the same for everyγ

=F θ(w(1,ϵ)) 3 (z⋆ 1 , c) =c for every ϵ and every γ. So the entire family of fixed-log replay response curves is also the same for everyγ. We have shown that for everyα, β∈R, Rνα,1(z⋆

work page
[28]

We now compute the finite interventional target

=R νβ ,1(z⋆ 1). We now compute the finite interventional target. Under the perturbationw (1,ϵ), the state at time2is θϵ 2 =ϵ. Therefore, conditional on the realized prefixz ⋆ 1, Qϵ νγ ,1(1|z ⋆

work page
[29]

Since the terminal target equals the second-round interaction, Ψϵ νγ ,1(z⋆

=σ(γϵ). Since the terminal target equals the second-round interaction, Ψϵ νγ ,1(z⋆

work page
[30]

= X c∈{0,1} Qϵ νγ ,1(c|z ⋆ 1)c=Q ϵ νγ ,1(1|z ⋆

work page
[31]

Hence forα̸=β, Ψϵ να,1(z⋆ 1)̸= Ψ ϵ νβ ,1(z⋆ 1) wheneverαϵ̸=βϵ

=σ(γϵ). Hence forα̸=β, Ψϵ να,1(z⋆ 1)̸= Ψ ϵ νβ ,1(z⋆ 1) wheneverαϵ̸=βϵ. Differentiating atϵ= 0, we obtain I int 1,νγ(F;z ⋆

work page
[32]

20 Thus forα̸=β, I int 1,να(F;z ⋆ 1)̸=I int 1,νβ(F;z ⋆ 1)

= d dϵ σ(γϵ) ϵ=0 =γσ ′(0) = γ 4 . 20 Thus forα̸=β, I int 1,να(F;z ⋆ 1)̸=I int 1,νβ(F;z ⋆ 1). Finally, suppose for contradiction that over this class the conditional interventional target were identified by a functional of the replay oracle. Then there would exist a measurable map Φ such that for everyγ, I int 1,νγ(F;z ⋆

work page
[33]

But the replay oracles agree forν α andν β, so this would imply I int 1,να(F;z ⋆

= Φ Rνγ ,1(z⋆ 1) . But the replay oracles agree forν α andν β, so this would imply I int 1,να(F;z ⋆

work page
[34]

Therefore the target is not identified by any functional of the replay oracle over this class

=I int 1,νβ(F;z ⋆ 1), contradicting the calculation above. Therefore the target is not identified by any functional of the replay oracle over this class. E Proofs for Section 6 E.1 Proof of Theorem 3 Proof. Fix t∈ {1, . . . , T} , fix a realized prefixh=z 1:t ∈ H t with Pν(h)>0 , and fix a continuation c= ((x t+1, at+1, rt+1), . . . ,(xT , aT , rT ))∈ C t...

work page
[35]

=ϵ For each parameterγ∈R, we define an environmentν γ by its reward law: P θ γ (R2 = 1|z ⋆

work page
[36]

=σ(γθ), P θ γ (R2 = 0|z ⋆

work page
[37]

Because context and action spaces are singletons, this satisfies the factorization for reward-state dependence.We first compute the baseline law

= 1−σ(γθ) where σ is the logistic sigmoid function. Because context and action spaces are singletons, this satisfies the factorization for reward-state dependence.We first compute the baseline law. Under the baseline process (wherew 1 = 1), the state entering round2isθ 2(z⋆

work page
[38]

Hence: P 0 γ (R2 = 1|z ⋆

= 0. Hence: P 0 γ (R2 = 1|z ⋆

work page
[39]

=σ(0) = 1 2 , P 0 γ (R2 = 0|z ⋆

work page
[40]

Thus, for any two environments α, β∈R , their baseline laws are identical: Pνα =P νβ .We now compute the interventional target at the realized prefixz⋆

= 1 2 Since the first round is deterministic, the full baseline law onH 2 is: Pνγ(z⋆ 1 ,1) = 1 2 ,P νγ(z⋆ 1 ,0) = 1 2 This is entirely independent of γ. Thus, for any two environments α, β∈R , their baseline laws are identical: Pνα =P νβ .We now compute the interventional target at the realized prefixz⋆

work page
[41]

Under the perturbationw (1,ϵ), the state at round2isϵ, so the perturbed future law is: Qϵ νγ ,1(1|z ⋆

work page
[42]

Therefore, the finite conditional interventional effect is: Ψϵ νγ ,1(z⋆

=σ(γϵ) Because U2(θ, r,1) =r and F(θ 3) =θ 3, the terminal target equals the round-2 reward. Therefore, the finite conditional interventional effect is: Ψϵ νγ ,1(z⋆

work page
[43]

=σ(γϵ) Differentiating at0yields the influence: I int 1,νγ(F;z ⋆

work page
[44]

Since the values are different, this is impossible

= d dϵ σ(γϵ) ϵ=0 = γ 4 Hence, forα̸=β, their interventional targets strictly diverge: I int 1,να(F;z ⋆ 1)̸=I int 1,νβ(F;z ⋆ 1) If the target were identified from the baseline law over this class, it would have to take the same value on να and νβ because their baseline laws agree exactly. Since the values are different, this is impossible. Extension to Con...

work page
[45]

The baseline laws again remain perfectly identical across all γ, but the interventional targets diverge exactly as shown above

=σ(γθ) . The baseline laws again remain perfectly identical across all γ, but the interventional targets diverge exactly as shown above. Therefore, identification fails in both cases. 23 F Proofs for section 7 F.1 Proofs for the Directional Failure of Replay (Theorem 4) To formally prove the directional failure of replay in the horizon-2 bandit (Theorem 4...

work page
[46]

For the interventional target, we evaluate the recollected expectation given the perturbed intermediate policy pϵ 2, so Ψϵ 1(z⋆

=c q,η1,p G′ µ,η2(p) EPµ[I rep 1 (F;Z 1:2)|Z 1 =z ⋆ 1] =c q,η1,p Rµ,η2(p) Proof. For the interventional target, we evaluate the recollected expectation given the perturbed intermediate policy pϵ 2, so Ψϵ 1(z⋆

work page
[47]

three gears

=G µ,η2(pϵ 2). Applying the chain rule yields G′ µ,η2(p)· d dϵ pϵ 2 ϵ=0, which equalsc q,η1,p G′ µ,η2(p). For the replay target, we evaluate the derivative along each fixed path and then take the baseline expectation. If (A2, R2) = (1,1) , the final policy is fη2(pϵ 2), and its derivative at 0 is cq,η1,pf ′ η2(p). Weighting the derivatives of all four pos...

work page