arxiv: 2605.11239 · v1 · submitted 2026-05-11 · 💻 cs.LG · stat.ML

Recognition: no theorem link

Extending Kernel Trick to Influence Functions

Zhenhuan Sun , Shahrokh Valaee

Authors on Pith no claims yet

Pith reviewed 2026-05-13 02:03 UTC · model grok-4.3

classification 💻 cs.LG stat.ML

keywords influence functionsdual representationkernel tricklinearizable modelsdata influence estimationcomputational efficiencymodel interpretabilitymachine learning

0 comments

The pith

A dual representation of influence functions computes their effects using dataset size rather than model parameter count for linearizable models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper introduces a dual form of influence functions that allows estimating the impact of removing individual data points on model parameters, outputs, and loss without depending on the size of the model. The approach relies on rewriting the computations to scale with the number of training examples instead of the number of parameters. It serves as an efficient alternative when models are very large compared to their datasets or when direct computation in parameter space is impractical. The method applies only to models whose behavior stays close to their linear approximations and requires storing an intermediate matrix whose size is set by output dimension times dataset size.

Core claim

We present a dual representation of the influence functions, whose computational complexity scales with dataset size rather than model size. Both analytically and experimentally, we show that this representation can be an efficient alternative to the original influence functions for estimating changes in parameters, model outputs and loss due to data point removal, when model size is large relative to dataset size, or when evaluating the original influence functions in parameter space is infeasible. The dual representation, however, is limited to linearizable models, which are models whose behavior can be approximated by their linearizations throughout training, and requires materializing a

What carries the argument

The dual representation of influence functions obtained by applying the kernel trick, which rewrites the required quantities in terms of a data-dependent matrix instead of operations over model parameters.

If this is right

Changes in model parameters after data point removal can be estimated without forming or inverting large matrices in parameter space.
Changes in model outputs and loss values after data removal follow directly from the same dual quantities.
The method becomes preferable whenever the number of model parameters greatly exceeds the number of training examples.
It supplies a practical workaround when the original influence function computation is numerically unstable or memory-prohibitive in parameter space.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The approach could be tested on transformer-based models by first verifying how closely their training trajectories stay near linear approximations.
Similar dual rewritings might apply to other data-attribution techniques that currently operate in parameter space.
When output dimension is high, the required intermediate matrix may still become costly, suggesting a need for further low-rank or sparse approximations.

Load-bearing premise

The model must behave approximately like its linearization around the trained parameters throughout the training process.

What would settle it

Retraining the model after removing specific data points on a small linearizable model and checking whether the dual estimates match the observed changes in parameters, outputs, and loss within the expected linearization error.

Figures

Figures reproduced from arXiv: 2605.11239 by Shahrokh Valaee, Zhenhuan Sun.

**Figure 1.** Figure 1: Comparison of the unlearning performance of θ-space and ∆α-space methods on linearized FCNN and CNN trained on subsets of MNIST and CIFAR10. The top and bottom rows correspond to the linearized FCNN and CNN, respectively. We compare the θ-space method with two implementations of the ∆α-space method, and with a random perturbation baseline. From left to right, we report cold-start runtime, warm-start runtim… view at source ↗

**Figure 2.** Figure 2: Comparison of the unlearning performance of θ-space and ∆α-space methods on a nonlinear model under different values of λ. The dashed black curves indicate the accuracy of the retrained model. We adopt a scalar-valued FCNN with two hidden layers of width 2048 and ReLU activations. Both the model and its linearization around initialization are trained by minimizing LˆD(θ) with squared error loss on a bina… view at source ↗

**Figure 3.** Figure 3: Comparison of the training dynamics of a FCNN and its linearization under different values of λ. The left two panes show the relative ℓ2 distance between the parameters of FCNN and its linearization, and the RMSE between their predictions on Dt. The right two panes show the accuracies of the FCNN and its linearization on Dt, and the gradient norms of their parameters. We observe that larger values of λ mak… view at source ↗

**Figure 4.** Figure 4: Comparison of estimated and actual changes in the outputs and loss of two infinitely wide neural networks at 10 test data points after removing 50% of data points from D. Our work builds on the work of Koh and Liang [1], who applied influence functions from robust statistics to machine learning to study the effect of perturbing a training data point on model parameters, loss, and outputs. In their work, th… view at source ↗

read the original abstract

In this paper, we present a dual representation of the influence functions, whose computational complexity scales with dataset size rather than model size. Both analytically and experimentally, we show that this representation can be an efficient alternative to the original influence functions for estimating changes in parameters, model outputs and loss due to data point removal, when model size is large relative to dataset size, or when evaluating the original influence functions in parameter space is infeasible. The dual representation, however, is limited to linearizable models, which are models whose behavior can be approximated by their linearizations throughout training, and requires materializing a matrix, whose size grows with the product of model output dimension and dataset size.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper gives a dual kernel-style form for influence functions that scales with dataset size instead of model size, but only inside the linearizable-model regime.

read the letter

The new piece is the dual representation derived by applying the kernel trick to the standard influence-function linearization. It lets you estimate the effect of removing a point on parameters, outputs, and loss without touching the full parameter space when the model is large. They claim analytic equivalence to the usual form under the linearization assumption and back it with experiments that compare the two on parameter-shift and loss-change tasks. That is a real technical move; the dual form had not shown up in the influence-function literature before, and the scoping in the abstract is straightforward about the limits. The matrix they need to store grows with output dimension times dataset size, which is the obvious cost trade-off they flag. If the model has high-dimensional outputs the storage hit could erase the gain, and the experiments will need to show that the approximation stays accurate on non-toy data without extra fitting steps. The derivation looks standard once you accept the linearizable-model premise, with no circularity or hidden fitting. This is useful for people already working on data attribution or debugging large models where the original influence functions are intractable. It is not a general replacement. I would bring it to a reading group focused on influence functions or kernel methods in ML. The work is clear enough on its own terms to deserve a serious referee rather than a desk reject.

Referee Report

0 major / 3 minor

Summary. The paper introduces a dual representation of influence functions that extends the kernel trick, allowing computational complexity to scale with dataset size rather than model parameter count. It claims both analytic equivalence to the standard (primal) influence functions under the linearizable model assumption and experimental validation showing that the dual form efficiently approximates the effects of removing a data point on parameters, model outputs, and loss. The approach is explicitly scoped to linearizable models (those approximable by their linearizations throughout training) and requires materializing an output-dimension-by-dataset-size matrix.

Significance. If the claimed analytic equivalence holds without hidden restrictions and the experiments confirm practical gains, the work would meaningfully extend the reach of influence functions to large-scale models where parameter-space computation is prohibitive. It offers a concrete efficiency trade-off (data size vs. model size) that could support broader use in data influence analysis, model debugging, and interpretability for deep networks under the stated assumptions. The explicit scoping to linearizable models and matrix-size caveat is a strength that prevents overclaiming.

minor comments (3)

[Abstract] Abstract and introduction: the phrase 'linearizable models' is used without an immediate forward reference to its formal definition (likely in §2 or §3); adding a one-sentence characterization or equation pointer would improve readability for readers scanning the scope.
[Derivation section] Derivation of the dual form: while the abstract asserts analytic equivalence, the steps linking the standard influence-function Hessian-inverse expression to the dual (kernel) representation should be cross-referenced to the exact equations (e.g., the linearization assumption and the Woodbury or kernel-matrix identity used).
[Experiments] Experimental section: the reported speed-ups and accuracy comparisons would benefit from explicit statements of the output dimension, dataset sizes, and model architectures tested, plus a brief check that the linearizable assumption held for the chosen networks (e.g., via a linearity-error metric).

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for their positive summary of our work and for recommending minor revision. The referee accurately captures the paper's core contribution (a dual representation of influence functions whose complexity scales with dataset size) along with its explicit scope to linearizable models and the associated matrix-size requirement. No specific major comments were raised in the report.

Circularity Check

0 steps flagged

No significant circularity

full rationale

The derivation presents a dual representation obtained by extending the kernel trick under the standard linearization assumptions for linearizable models. This yields an analytical equivalence for estimating parameter/output/loss changes due to data removal, with complexity scaling to dataset size when model size is larger. The claim is scoped explicitly to those models and the materialization of an output-dimension-by-dataset-size matrix; both the analytical step and experimental validation are independent of any fitted parameter or self-referential definition. No load-bearing self-citation, self-definitional step, or prediction-by-construction is present.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The central claim rests on the assumption that models remain close to their linearizations; no free parameters, axioms, or invented entities are explicitly introduced in the abstract.

pith-pipeline@v0.9.0 · 5401 in / 1058 out tokens · 39060 ms · 2026-05-13T02:03:12.118246+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

19 extracted references · 19 canonical work pages

[1]

Understanding black-box predictions via influence functions

Pang Wei Koh and Percy Liang. Understanding black-box predictions via influence functions. In International Conference on Machine Learning (ICML), 2017

work page 2017
[2]

A survey of machine unlearning.ACM Transactions on Intelligent Systems and Technology, 2025

Thanh Tam Nguyen, Thanh Trung Huynh, Zhao Ren, Phi Le Nguyen, Alan Wee-Chung Liew, Hongzhi Yin, and Quoc Viet Hung Nguyen. A survey of machine unlearning.ACM Transactions on Intelligent Systems and Technology, 2025

work page 2025
[3]

Certified data removal from machine learning models

Chuan Guo, Tom Goldstein, Awni Hannun, and Laurens Van Der Maaten. Certified data removal from machine learning models. InInternational Conference on Machine Learning (ICML), 2020

work page 2020
[4]

Neural tangent kernel: Convergence and generalization in neural networks

Arthur Jacot, Franck Gabriel, and Clément Hongler. Neural tangent kernel: Convergence and generalization in neural networks. InAdvances in Neural Information Processing Systems (NeurIPS), 2018

work page 2018
[5]

Wide neural networks of any depth evolve as linear models under gradient descent

Jaehoon Lee, Lechao Xiao, Samuel Schoenholz, Yasaman Bahri, Roman Novak, Jascha Sohl-Dickstein, and Jeffrey Pennington. Wide neural networks of any depth evolve as linear models under gradient descent. InAdvances in Neural Information Processing Systems (NeurIPS), 2019

work page 2019
[6]

JAX: compos- able transformations of Python+NumPy programs, 2018

James Bradbury, Roy Frostig, Peter Hawkins, Matthew James Johnson, Chris Leary, Dougal Maclaurin, George Necula, Adam Paszke, Jake VanderPlas, Skye Wanderman-Milne, and Qiao Zhang. JAX: compos- able transformations of Python+NumPy programs, 2018. URLhttp://github.com/jax-ml/jax

work page 2018
[7]

Alemi, Jascha Sohl-Dickstein, and Samuel S

Roman Novak, Lechao Xiao, Jiri Hron, Jaehoon Lee, Alexander A. Alemi, Jascha Sohl-Dickstein, and Samuel S. Schoenholz. Neural tangents: Fast and easy infinite neural networks in python. InInterna- tional Conference on Learning Representations (ICLR), 2020. URL https://github.com/google/ neural-tangents

work page 2020
[8]

Schoenholz

Roman Novak, Jascha Sohl-Dickstein, and Samuel S. Schoenholz. Fast finite width neural tangent kernel. InInternational Conference on Machine Learning (ICML), 2022

work page 2022
[9]

Infinite attention: Nngp and ntk for deep attention networks

Jiri Hron, Yasaman Bahri, Jascha Sohl-Dickstein, and Roman Novak. Infinite attention: Nngp and ntk for deep attention networks. InInternational Conference on Machine Learning (ICML), 2020

work page 2020
[10]

On the infinite width limit of neural networks with a standard parameterization.arXiv preprint, 2020

Jascha Sohl-Dickstein, Roman Novak, Samuel S Schoenholz, and Jaehoon Lee. On the infinite width limit of neural networks with a standard parameterization.arXiv preprint, 2020

work page 2020
[11]

Fast neural kernel embeddings for general activations

Insu Han, Amir Zandieh, Jaehoon Lee, Roman Novak, Lechao Xiao, and Amin Karbasi. Fast neural kernel embeddings for general activations. InAdvances in Neural Information Processing Systems (NeurIPS), 2022

work page 2022
[12]

Gradient-based learning applied to document recognition.Proceedings of the IEEE, 2002

Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning applied to document recognition.Proceedings of the IEEE, 2002

work page 2002
[13]

Learning multiple layers of features from tiny images

Alex Krizhevsky and Geoffrey Hinton. Learning multiple layers of features from tiny images. 2009

work page 2009
[14]

Influence functions in deep learning are fragile

S Basu, P Pope, and S Feizi. Influence functions in deep learning are fragile. InInternational Conference on Learning Representations (ICLR), 2021

work page 2021
[15]

Scaling up influence functions

Andrea Schioppa, Polina Zablotskaia, David Vilar, and Artem Sokolov. Scaling up influence functions. In AAAI Conference on Artificial Intelligence, 2022

work page 2022
[16]

Studying large language model generalization with influence functions.arXiv preprint, 2023

Roger Grosse, Juhan Bae, Cem Anil, Nelson Elhage, Alex Tamkin, Amirhossein Tajdini, Benoit Steiner, Dustin Li, Esin Durmus, Ethan Perez, et al. Studying large language model generalization with influence functions.arXiv preprint, 2023

work page 2023
[17]

Machine unlearning of features and labels

Alexander Warnecke, Lukas Pirch, Christian Wressnegger, and Konrad Rieck. Machine unlearning of features and labels. InNetwork and Distributed System Security (NDSS) Symposium, 2023

work page 2023
[18]

Gradient descent finds global minima of deep neural networks

Simon Du, Jason Lee, Haochuan Li, Liwei Wang, and Xiyu Zhai. Gradient descent finds global minima of deep neural networks. InInternational Conference on Machine Learning (ICML), 2019

work page 2019
[19]

A convergence theory for deep learning via over- parameterization

Zeyuan Allen-Zhu, Yuanzhi Li, and Zhao Song. A convergence theory for deep learning via over- parameterization. InInternational Conference on Machine Learning (ICML), 2019. 10 A Proofs We present proofs for the propositions stated in Section 3.1. The proofs for the setting in which ∥θ−θ ′∥2 2 is replaced by∥θ∥ 2 2 are analogous. Proof of Proposition 3.1. ...

work page 2019