On the Accuracy of Newton Step and Influence Function Data Attributions
Pith reviewed 2026-05-21 17:42 UTC · model grok-4.3
The pith
Newton step data attributions are asymptotically more accurate than influence functions for logistic regression models.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The central claim is that a new analysis without global strong convexity yields asymptotically tight bounds for Newton step and influence function data attributions in logistic regression, specifically that the expected L2 error of the Newton step approximation is tilde-Theta of kd over n squared and the difference between Newton step and influence function is tilde-Theta of (k plus d) times square root of kd over n squared, for average-case removals of k samples.
What carries the argument
A new analysis technique for the differences in learned parameters after removing subsets of training data, which operates under milder well-behaved conditions on the logistic loss to achieve tight asymptotic scaling laws.
Load-bearing premise
The logistic regression must satisfy sufficiently well-behaved conditions milder than global strong convexity.
What would settle it
Measuring the average L2 norm differences in parameter estimates for logistic regression models trained on datasets with varying sizes n and removal counts k, and checking whether the observed errors follow the claimed scaling with n, k, and d.
Figures
read the original abstract
Data attribution aims to explain model predictions by estimating how they would change if certain training points were removed, and is used in a wide range of applications, from interpretability and credit assignment to unlearning and privacy. Even in the relatively simple case of logistic regressions, existing mathematical analyses of leading data attribution methods such as Influence Functions (IF) and single Newton Step (NS) remain limited in two key ways. First, they rely on global strong convexity assumptions which are often not satisfied in practice. Second, the resulting bounds scale very poorly with the number of parameters ($d$) and the number of samples removed ($k$). As a result, these analyses are not tight enough to answer fundamental questions such as "what is the asymptotic scaling of the errors of each method?" or "which of these methods is more accurate for a given dataset?" In this paper, we introduce a new analysis of the NS and IF data attribution methods for convex learning problems. To the best of our knowledge, this is the first analysis of these questions that does not assume global strong convexity and also the first explanation of [KATL19] and [RH25a]'s observation that NS data attribution is often more accurate than IF. We prove that for sufficiently well-behaved logistic regressions, our bounds are asymptotically tight up to poly-logarithmic factors, yielding scaling laws for the errors in the average-case sample removals. \[ \mathbb{E}_{T \subseteq [n],\, |T| = k} \bigl[ \|\hat{\theta}_T - \hat{\theta}_T^{\mathrm{NS}}\|_2 \bigr] = \widetilde{\Theta}\!\left(\frac{k d}{n^2}\right), \qquad \mathbb{E}_{T \subseteq [n],\, |T| = k} \bigl[ \|\hat{\theta}_T^{\mathrm{NS}} - \hat{\theta}_T^{\mathrm{IF}}\|_2 \bigr] = \widetilde{\Theta}\!\left( \frac{(k + d)\sqrt{k d}}{n^2} \right). \]
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces a new analysis of Newton Step (NS) and Influence Function (IF) data attribution methods for convex learning problems, with a focus on logistic regression. It avoids global strong convexity assumptions and proves asymptotic tightness (up to poly-log factors) of the error bounds under 'sufficiently well-behaved' conditions, yielding explicit scaling laws for the average-case effect of removing k samples: E[||θ̂_T - θ̂_T^NS||_2] = ~Θ(kd/n²) and E[||θ̂_T^NS - θ̂_T^IF||_2] = ~Θ((k+d)√(kd)/n²).
Significance. If the results hold, the work is significant for providing the first analysis of NS and IF attributions that does not rely on global strong convexity and for deriving matching upper/lower bounds that explain why NS is often more accurate than IF in practice. The explicit scaling laws under milder local conditions advance understanding of data attribution accuracy in convex models.
major comments (2)
- [Abstract and main theorem] Abstract and main theorem (presumably §3 or §4): The central claim of asymptotic tightness for the stated scaling laws relies on 'sufficiently well-behaved' logistic regression conditions that enable the new local analysis. These conditions (e.g., any local eigenvalue bounds, smoothness parameters, or data-distribution assumptions) must be stated explicitly in the theorem, as their current vagueness prevents verification that they are strictly milder than global strong convexity or that the matching bounds up to polylog factors follow directly from them.
- [§5] §5 (or the section deriving the IF-NS difference): The bound E[||θ̂_T^NS - θ̂_T^IF||_2] = ~Θ((k+d)√(kd)/n²) is load-bearing for the claim that NS is more accurate; the derivation should include a direct comparison showing when this term is o( the NS error term kd/n² ) under the same well-behaved conditions.
minor comments (1)
- [Abstract] Notation for the expectation E_{T ⊆ [n], |T|=k} should be clarified to specify whether it is uniform over subsets or weighted by some distribution on removals.
Simulated Author's Rebuttal
We thank the referee for their constructive feedback, which highlights important aspects of clarity in our theoretical claims. We address each major comment below and will incorporate revisions to strengthen the manuscript.
read point-by-point responses
-
Referee: [Abstract and main theorem] Abstract and main theorem (presumably §3 or §4): The central claim of asymptotic tightness for the stated scaling laws relies on 'sufficiently well-behaved' logistic regression conditions that enable the new local analysis. These conditions (e.g., any local eigenvalue bounds, smoothness parameters, or data-distribution assumptions) must be stated explicitly in the theorem, as their current vagueness prevents verification that they are strictly milder than global strong convexity or that the matching bounds up to polylog factors follow directly from them.
Authors: We agree that the conditions require explicit statement for full rigor and to facilitate verification. In the revised manuscript, we will restate the main theorem with precise definitions of the 'sufficiently well-behaved' conditions, including local lower and upper bounds on the Hessian eigenvalues (in a neighborhood of the optimum) and local smoothness parameters on the loss. These local conditions are strictly milder than global strong convexity, as they permit the Hessian to vary or become singular far from the optimum. We will also include a brief derivation sketch showing how the matching upper and lower bounds (up to poly-log factors) follow directly from these assumptions via the local analysis in §3. revision: yes
-
Referee: [§5] §5 (or the section deriving the IF-NS difference): The bound E[||θ̂_T^NS - θ̂_T^IF||_2] = ~Θ((k+d)√(kd)/n²) is load-bearing for the claim that NS is more accurate; the derivation should include a direct comparison showing when this term is o( the NS error term kd/n² ) under the same well-behaved conditions.
Authors: We thank the referee for this suggestion, which clarifies the practical implications. Under the same well-behaved conditions, the NS-IF difference is asymptotically smaller than the NS error whenever k = o(d), since (k + d)√(kd) = o(kd) in that regime. We will add a direct comparison paragraph in §5 that derives this o(·) relation explicitly from the local eigenvalue bounds, and we will state the precise regime (e.g., k ≪ d) under which the difference vanishes relative to the leading NS term, thereby reinforcing why NS is typically more accurate. revision: yes
Circularity Check
No significant circularity; new analysis derives scaling laws from first-principles bounds on local properties.
full rationale
The paper presents a new analysis of Newton Step and Influence Function methods for convex problems that avoids global strong convexity assumptions. The asymptotic tightness claims and scaling laws E[||θ̂_T - θ̂_T^NS||_2] = ~Θ(kd/n²) and E[||θ̂_T^NS - θ̂_T^IF||_2] = ~Θ((k+d)√(kd)/n²) are obtained via direct bounding arguments under 'sufficiently well-behaved' local conditions rather than by fitting parameters to data or reducing to prior self-citations. Self-citations to [KATL19] and [RH25a] explain empirical observations but are not load-bearing for the new bounds, which are derived independently. No step equates a prediction to its input by construction or imports uniqueness via self-citation chains.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption The learning problem is convex
- ad hoc to paper The model satisfies sufficiently well-behaved conditions allowing local analysis
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We place assumptions on the local strong convexity and local (higher-order) Lipschitzness of L only in a small neighborhood of the Newton step itself
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Lecture 5: Sub-exponential random variables and orlicz norms
[APAR19] scribe Aleksandr Podkopaev and lecturer Alessandro Rinaldo. Lecture 5: Sub-exponential random variables and orlicz norms. Scribed lecture notes 36-710 / 36-709, Spring 2019, Carnegie Mellon University, Department of Statistics & Data Science, Pittsburgh, PA, USA, February
work page 2019
-
[2]
11 [BGM20] Tamara Broderick, Ryan Giordano, and Rachael Meager. An automatic finite-sample robustness metric: when can dropping a little data make a big difference?arXiv preprint arXiv:2011.14999,
-
[3]
[BHH+24] Gavin Brown, Jonathan Hayase, Samuel Hopkins, Weihao Kong, Xiyang Liu, Sewoong Oh, Juan C Perdomo, and Adam Smith. Insufficient statistics perturbation: Stable estimators for private least squares.arXiv preprint arXiv:2404.15409,
-
[4]
[BNL+22] Juhan Bae, Nathan Ng, Alston Lo, Marzyeh Ghassemi, and Roger B. Grosse. If influence functions are the answer, then what is the question? In Sanmi Koyejo, S. Mohamed, A. Agarwal, Danielle Belgrave, K. Cho, and A. Oh, editors,Advances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems 2022, Neur...
work page 2022
-
[5]
Influence functions in deep learning are fragile
[BPF21] Samyadeep Basu, Phillip Pope, and Soheil Feizi. Influence functions in deep learning are fragile. In9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7,
work page 2021
-
[6]
[CAB+24] Sang Keun Choe, Hwijeen Ahn, Juhan Bae, Kewen Zhao, Minsoo Kang, Youngseog Chung, Adithya Pratapa, Willie Neiswanger, Emma Strubell, Teruko Mitamura, et al. What is your data worth to gpt? llm-scale data valuation with influence functions.arXiv preprint arXiv:2405.13954,
-
[7]
Optimizing ml training with metagradient descent.arXiv preprint arXiv:2503.13751,
[EIC+25] Logan Engstrom, Andrew Ilyas, Benjamin Chen, Axel Feldmann, William Moses, and Aleksander Madry. Optimizing ml training with metagradient descent.arXiv preprint arXiv:2503.13751,
-
[8]
[GBA+23] Roger Grosse, Juhan Bae, Cem Anil, Nelson Elhage, Alex Tamkin, Amirhossein Tajdini, Benoit Steiner, Dustin Li, Esin Durmus, Ethan Perez, et al. Studying large language model generalization with influence functions.arXiv preprint arXiv:2308.03296,
-
[9]
Fastif: Scalable influence functions for efficient model interpretation and debugging
[GRH+21] Han Guo, Nazneen Rajani, Peter Hase, Mohit Bansal, and Caiming Xiong. Fastif: Scalable influence functions for efficient model interpretation and debugging. InProceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 10333–10350,
work page 2021
-
[10]
[HBN+24] Jenny Y Huang, David R Burt, Tin D Nguyen, Yunyi Shen, and Tamara Broderick. Approxima- tions to worst-case data dropping: unmasking failure modes.arXiv preprint arXiv:2408.09008,
-
[11]
Datamodels: Understanding predictions with data and data with predictions
[IPE+22] Andrew Ilyas, Sung Min Park, Logan Engstrom, Guillaume Leclerc, and Aleksander Madry. Datamodels: Understanding predictions with data and data with predictions. In Kamalika Chaudhuri, Stefanie Jegelka, Le Song, Csaba Szepesvári, Gang Niu, and Sivan Sabato, editors, International Conference on Machine Learning, ICML 2022, 17-23 July 2022, Baltimor...
work page 2022
-
[12]
A Short Note on Concentration Inequalities for Random Vectors with SubGaussian Norm
[JNG+19] Chi Jin, Praneeth Netrapalli, Rong Ge, Sham M Kakade, and Michael I Jordan. A short note on concentration inequalities for random vectors with subgaussian norm.arXiv preprint arXiv:1902.03736,
work page internal anchor Pith review Pith/arXiv arXiv 1902
-
[13]
[KATL19] Pang Wei Koh, Kai-Siang Ang, Hubert H. K. Teo, and Percy Liang. On the accuracy of influence functions for measuring group effects. In Hanna M. Wallach, Hugo Larochelle, Alina Beygelzimer, Florence d’Alché-Buc, Emily B. Fox, and Roman Garnett, editors,Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Pr...
work page 2019
-
[14]
Understanding black-box predictions via influence functions
[KL17] Pang Wei Koh and Percy Liang. Understanding black-box predictions via influence functions. In Doina Precup and Yee Whye Teh, editors,Proceedings of the 34th International Conference on Machine Learning, ICML 2017, Sydney, NSW, Australia, 6-11 August 2017, volume 70 of Proceedings of Machine Learning Research, pages 1885–1894. PMLR,
work page 2017
-
[15]
[LW24] Omri Lev and Ashia C Wilson. The approximate fisher influence function: Faster estimation of data influence in statistical models.arXiv preprint arXiv:2407.08169,
-
[16]
Influence functions for scalable data attribution in diffusion models
[MEB+] Bruno Kacper Mlodozeniec, Runa Eschenhagen, Juhan Bae, Alexander Immer, David Krueger, and Richard E Turner. Influence functions for scalable data attribution in diffusion models. In The Thirteenth International Conference on Learning Representations. 13 [MIE+24] Aleksander Madry, Andrew Ilyas, Logan Engstrom, Sung Min (Sam) Park, and Kristian Geor...
work page 2024
-
[17]
Tutorial presented at the 41st International Conference on Machine Learning (ICML 2024), Vienna, Austria, July 22,
work page 2024
-
[18]
Trak: Attributing model behavior at scale.arXiv preprint arXiv:2303.14186,
[PGI+23] Sung Min Park, Kristian Georgiev, Andrew Ilyas, Guillaume Leclerc, and Aleksander Madry. Trak: Attributing model behavior at scale.arXiv preprint arXiv:2303.14186,
-
[19]
[RH25a] Ittai Rubinstein and Samuel B Hopkins. Rescaled influence functions: Accurate data attribution in high dimension.arXiv preprint arXiv:2506.06656,
-
[20]
[RH25b] Ittai Rubinstein and Samuel B. Hopkins. Robustness auditing for linear regression: To sin- gularity and beyond. InProceedings of the Thirteenth International Conference on Learning Representations (ICLR 2025),
work page 2025
-
[21]
[RM18] Kamiar Rahnama Rad and Arian Maleki. A scalable estimate of the extra-sample prediction error via approximate leave-one-out.arXiv preprint arXiv:1801.10243,
-
[22]
Remember what you want to forget: Algorithms for machine unlearning
[SAKS21b] Ayush Sekhari, Jayadev Acharya, Gautam Kamath, and Ananda Theertha Suresh. Remember what you want to forget: Algorithms for machine unlearning. In Marc’Aurelio Ranzato, Alina Beygelzimer, Yann N. Dauphin, Percy Liang, and Jennifer Wortman Vaughan, editors,Advances in Neural Information Processing Systems 34: Annual Conference on Neural Informati...
work page 2021
-
[23]
[SW22] Vinith M. Suriyakumar and Ashia C. Wilson. Algorithms that approximate data removal: New results and limitations. In Sanmi Koyejo, S. Mohamed, A. Agarwal, Danielle Belgrave, K. Cho, and A. Oh, editors,Advances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems 2022, NeurIPS 2022, New Orleans, LA,...
work page 2022
-
[24]
Introduction to the non-asymptotic analysis of random matrices
[Ver10] Roman Vershynin. Introduction to the non-asymptotic analysis of random matrices.arXiv preprint arXiv:1011.3027,
work page internal anchor Pith review Pith/arXiv arXiv
-
[25]
[ZAK+25] Haolin Zou, Arnab Auddy, Yongchan Kwon, Kamiar Rahnama Rad, and Arian Maleki. Newflu- ence: Boosting model interpretability and understanding in high dimensions.arXiv preprint arXiv:2507.11895,
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.