Recognition: no theorem link
Extending Kernel Trick to Influence Functions
Pith reviewed 2026-05-13 02:03 UTC · model grok-4.3
The pith
A dual representation of influence functions computes their effects using dataset size rather than model parameter count for linearizable models.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
We present a dual representation of the influence functions, whose computational complexity scales with dataset size rather than model size. Both analytically and experimentally, we show that this representation can be an efficient alternative to the original influence functions for estimating changes in parameters, model outputs and loss due to data point removal, when model size is large relative to dataset size, or when evaluating the original influence functions in parameter space is infeasible. The dual representation, however, is limited to linearizable models, which are models whose behavior can be approximated by their linearizations throughout training, and requires materializing a
What carries the argument
The dual representation of influence functions obtained by applying the kernel trick, which rewrites the required quantities in terms of a data-dependent matrix instead of operations over model parameters.
If this is right
- Changes in model parameters after data point removal can be estimated without forming or inverting large matrices in parameter space.
- Changes in model outputs and loss values after data removal follow directly from the same dual quantities.
- The method becomes preferable whenever the number of model parameters greatly exceeds the number of training examples.
- It supplies a practical workaround when the original influence function computation is numerically unstable or memory-prohibitive in parameter space.
Where Pith is reading between the lines
- The approach could be tested on transformer-based models by first verifying how closely their training trajectories stay near linear approximations.
- Similar dual rewritings might apply to other data-attribution techniques that currently operate in parameter space.
- When output dimension is high, the required intermediate matrix may still become costly, suggesting a need for further low-rank or sparse approximations.
Load-bearing premise
The model must behave approximately like its linearization around the trained parameters throughout the training process.
What would settle it
Retraining the model after removing specific data points on a small linearizable model and checking whether the dual estimates match the observed changes in parameters, outputs, and loss within the expected linearization error.
Figures
read the original abstract
In this paper, we present a dual representation of the influence functions, whose computational complexity scales with dataset size rather than model size. Both analytically and experimentally, we show that this representation can be an efficient alternative to the original influence functions for estimating changes in parameters, model outputs and loss due to data point removal, when model size is large relative to dataset size, or when evaluating the original influence functions in parameter space is infeasible. The dual representation, however, is limited to linearizable models, which are models whose behavior can be approximated by their linearizations throughout training, and requires materializing a matrix, whose size grows with the product of model output dimension and dataset size.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces a dual representation of influence functions that extends the kernel trick, allowing computational complexity to scale with dataset size rather than model parameter count. It claims both analytic equivalence to the standard (primal) influence functions under the linearizable model assumption and experimental validation showing that the dual form efficiently approximates the effects of removing a data point on parameters, model outputs, and loss. The approach is explicitly scoped to linearizable models (those approximable by their linearizations throughout training) and requires materializing an output-dimension-by-dataset-size matrix.
Significance. If the claimed analytic equivalence holds without hidden restrictions and the experiments confirm practical gains, the work would meaningfully extend the reach of influence functions to large-scale models where parameter-space computation is prohibitive. It offers a concrete efficiency trade-off (data size vs. model size) that could support broader use in data influence analysis, model debugging, and interpretability for deep networks under the stated assumptions. The explicit scoping to linearizable models and matrix-size caveat is a strength that prevents overclaiming.
minor comments (3)
- [Abstract] Abstract and introduction: the phrase 'linearizable models' is used without an immediate forward reference to its formal definition (likely in §2 or §3); adding a one-sentence characterization or equation pointer would improve readability for readers scanning the scope.
- [Derivation section] Derivation of the dual form: while the abstract asserts analytic equivalence, the steps linking the standard influence-function Hessian-inverse expression to the dual (kernel) representation should be cross-referenced to the exact equations (e.g., the linearization assumption and the Woodbury or kernel-matrix identity used).
- [Experiments] Experimental section: the reported speed-ups and accuracy comparisons would benefit from explicit statements of the output dimension, dataset sizes, and model architectures tested, plus a brief check that the linearizable assumption held for the chosen networks (e.g., via a linearity-error metric).
Simulated Author's Rebuttal
We thank the referee for their positive summary of our work and for recommending minor revision. The referee accurately captures the paper's core contribution (a dual representation of influence functions whose complexity scales with dataset size) along with its explicit scope to linearizable models and the associated matrix-size requirement. No specific major comments were raised in the report.
Circularity Check
No significant circularity
full rationale
The derivation presents a dual representation obtained by extending the kernel trick under the standard linearization assumptions for linearizable models. This yields an analytical equivalence for estimating parameter/output/loss changes due to data removal, with complexity scaling to dataset size when model size is larger. The claim is scoped explicitly to those models and the materialization of an output-dimension-by-dataset-size matrix; both the analytical step and experimental validation are independent of any fitted parameter or self-referential definition. No load-bearing self-citation, self-definitional step, or prediction-by-construction is present.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Understanding black-box predictions via influence functions
Pang Wei Koh and Percy Liang. Understanding black-box predictions via influence functions. In International Conference on Machine Learning (ICML), 2017
work page 2017
-
[2]
A survey of machine unlearning.ACM Transactions on Intelligent Systems and Technology, 2025
Thanh Tam Nguyen, Thanh Trung Huynh, Zhao Ren, Phi Le Nguyen, Alan Wee-Chung Liew, Hongzhi Yin, and Quoc Viet Hung Nguyen. A survey of machine unlearning.ACM Transactions on Intelligent Systems and Technology, 2025
work page 2025
-
[3]
Certified data removal from machine learning models
Chuan Guo, Tom Goldstein, Awni Hannun, and Laurens Van Der Maaten. Certified data removal from machine learning models. InInternational Conference on Machine Learning (ICML), 2020
work page 2020
-
[4]
Neural tangent kernel: Convergence and generalization in neural networks
Arthur Jacot, Franck Gabriel, and Clément Hongler. Neural tangent kernel: Convergence and generalization in neural networks. InAdvances in Neural Information Processing Systems (NeurIPS), 2018
work page 2018
-
[5]
Wide neural networks of any depth evolve as linear models under gradient descent
Jaehoon Lee, Lechao Xiao, Samuel Schoenholz, Yasaman Bahri, Roman Novak, Jascha Sohl-Dickstein, and Jeffrey Pennington. Wide neural networks of any depth evolve as linear models under gradient descent. InAdvances in Neural Information Processing Systems (NeurIPS), 2019
work page 2019
-
[6]
JAX: compos- able transformations of Python+NumPy programs, 2018
James Bradbury, Roy Frostig, Peter Hawkins, Matthew James Johnson, Chris Leary, Dougal Maclaurin, George Necula, Adam Paszke, Jake VanderPlas, Skye Wanderman-Milne, and Qiao Zhang. JAX: compos- able transformations of Python+NumPy programs, 2018. URLhttp://github.com/jax-ml/jax
work page 2018
-
[7]
Alemi, Jascha Sohl-Dickstein, and Samuel S
Roman Novak, Lechao Xiao, Jiri Hron, Jaehoon Lee, Alexander A. Alemi, Jascha Sohl-Dickstein, and Samuel S. Schoenholz. Neural tangents: Fast and easy infinite neural networks in python. InInterna- tional Conference on Learning Representations (ICLR), 2020. URL https://github.com/google/ neural-tangents
work page 2020
-
[8]
Roman Novak, Jascha Sohl-Dickstein, and Samuel S. Schoenholz. Fast finite width neural tangent kernel. InInternational Conference on Machine Learning (ICML), 2022
work page 2022
-
[9]
Infinite attention: Nngp and ntk for deep attention networks
Jiri Hron, Yasaman Bahri, Jascha Sohl-Dickstein, and Roman Novak. Infinite attention: Nngp and ntk for deep attention networks. InInternational Conference on Machine Learning (ICML), 2020
work page 2020
-
[10]
On the infinite width limit of neural networks with a standard parameterization.arXiv preprint, 2020
Jascha Sohl-Dickstein, Roman Novak, Samuel S Schoenholz, and Jaehoon Lee. On the infinite width limit of neural networks with a standard parameterization.arXiv preprint, 2020
work page 2020
-
[11]
Fast neural kernel embeddings for general activations
Insu Han, Amir Zandieh, Jaehoon Lee, Roman Novak, Lechao Xiao, and Amin Karbasi. Fast neural kernel embeddings for general activations. InAdvances in Neural Information Processing Systems (NeurIPS), 2022
work page 2022
-
[12]
Gradient-based learning applied to document recognition.Proceedings of the IEEE, 2002
Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning applied to document recognition.Proceedings of the IEEE, 2002
work page 2002
-
[13]
Learning multiple layers of features from tiny images
Alex Krizhevsky and Geoffrey Hinton. Learning multiple layers of features from tiny images. 2009
work page 2009
-
[14]
Influence functions in deep learning are fragile
S Basu, P Pope, and S Feizi. Influence functions in deep learning are fragile. InInternational Conference on Learning Representations (ICLR), 2021
work page 2021
-
[15]
Scaling up influence functions
Andrea Schioppa, Polina Zablotskaia, David Vilar, and Artem Sokolov. Scaling up influence functions. In AAAI Conference on Artificial Intelligence, 2022
work page 2022
-
[16]
Studying large language model generalization with influence functions.arXiv preprint, 2023
Roger Grosse, Juhan Bae, Cem Anil, Nelson Elhage, Alex Tamkin, Amirhossein Tajdini, Benoit Steiner, Dustin Li, Esin Durmus, Ethan Perez, et al. Studying large language model generalization with influence functions.arXiv preprint, 2023
work page 2023
-
[17]
Machine unlearning of features and labels
Alexander Warnecke, Lukas Pirch, Christian Wressnegger, and Konrad Rieck. Machine unlearning of features and labels. InNetwork and Distributed System Security (NDSS) Symposium, 2023
work page 2023
-
[18]
Gradient descent finds global minima of deep neural networks
Simon Du, Jason Lee, Haochuan Li, Liwei Wang, and Xiyu Zhai. Gradient descent finds global minima of deep neural networks. InInternational Conference on Machine Learning (ICML), 2019
work page 2019
-
[19]
A convergence theory for deep learning via over- parameterization
Zeyuan Allen-Zhu, Yuanzhi Li, and Zhao Song. A convergence theory for deep learning via over- parameterization. InInternational Conference on Machine Learning (ICML), 2019. 10 A Proofs We present proofs for the propositions stated in Section 3.1. The proofs for the setting in which ∥θ−θ ′∥2 2 is replaced by∥θ∥ 2 2 are analogous. Proof of Proposition 3.1. ...
work page 2019
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.