arxiv: 2605.10285 · v1 · submitted 2026-05-11 · 📊 stat.ML · cs.LG

Recognition: no theorem link

Scalable Gaussian process inference via neural feature maps

Anthony Stephenson

Pith reviewed 2026-05-12 04:54 UTC · model grok-4.3

classification 📊 stat.ML cs.LG

keywords Gaussian processesneural feature mapskernel methodsscalable inferenceRKHS approximationregressionclassification

0 comments

The pith

Neural feature maps let Gaussian processes perform exact inference at scale for regression and classification.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows how neural networks can learn feature maps that define expressive kernels for Gaussian processes. These maps act as optimal low-rank approximations to the Gram matrix of an implied reproducing kernel Hilbert space, which supports consistency of the resulting posterior. The construction allows exact inference to run quickly with little setup and handles both regression and classification on tabular or structured inputs like images. A reader would care because it combines the uncertainty handling of GPs with the representational power of neural networks without needing heavy approximations or custom engineering.

Core claim

The learned neural feature map serves as an optimal low-rank approximation to a Gram matrix derived from an implied RKHS, from which consistency of the GP posterior follows. The work further analyses the spectral properties of the induced kernels and introduces product feature-map kernels to address oversmoothing. This enables fast, scalable, and accurate exact GP inference with minimal upfront work across regression, classification, and diverse data modalities.

What carries the argument

Neural feature maps that induce kernels via inner products and act as low-rank approximations to implied RKHS Gram matrices.

Load-bearing premise

The neural network learns a feature map that sufficiently approximates the optimal low-rank structure of the kernel's reproducing kernel Hilbert space.

What would settle it

If the method produces posteriors that diverge from those of an exact GP on a dataset small enough for traditional exact computation, the consistency claim would be challenged.

Figures

Figures reproduced from arXiv: 2605.10285 by Anthony Stephenson.

**Figure 5.** Figure 5: fig. 5 [PITH_FULL_IMAGE:figures/full_fig_p017_5.png] view at source ↗

**Figure 2.** Figure 2: a shows how the eigenvalues decay for MLP Gram matrices with varying output dimension [PITH_FULL_IMAGE:figures/full_fig_p018_2.png] view at source ↗

**Figure 3.** Figure 3: a shows how the eigenvalue decay varies for [PITH_FULL_IMAGE:figures/full_fig_p018_3.png] view at source ↗

**Figure 4.** Figure 4: a shows eigenvalue decay of RBF, Exp kernels as well as a Nystr [PITH_FULL_IMAGE:figures/full_fig_p018_4.png] view at source ↗

**Figure 5.** Figure 5: Architecture of the “off-the-shelf” convolutional neural network. [PITH_FULL_IMAGE:figures/full_fig_p019_5.png] view at source ↗

**Figure 6.** Figure 6: Posterior distributions from an RBF, Exp and FM-GPs conditioned on 2-dimensional [PITH_FULL_IMAGE:figures/full_fig_p025_6.png] view at source ↗

**Figure 7.** Figure 7: Posterior distributions from an RBF and FM-GPs conditioned on 2-dimensional observa [PITH_FULL_IMAGE:figures/full_fig_p026_7.png] view at source ↗

**Figure 8.** Figure 8: Latent variable points, before being projected and warped into higher dimensions and the [PITH_FULL_IMAGE:figures/full_fig_p027_8.png] view at source ↗

read the original abstract

We present a theoretically grounded Gaussian process framework that leverages neural feature maps to construct expressive kernels. We show that the learned feature map can be interpreted as an optimal low-rank approximation to a Gram matrix derived from an implied RKHS, from which we establish consistency of the GP posterior. We further analyse the spectral properties of the induced kernels and introduce product feature-map kernels to address oversmoothing. This simple yet powerful approach enables fast, scalable, and accurate exact GP inference with minimal upfront work. The flexibility of kernel design supports seamless application to both regression and classification tasks across diverse data modalities, including tabular inputs and structured domains such as images. On benchmark datasets, this approach surpasses pre-existing methods in terms of accuracy and training and prediction efficiency.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper claims neural feature maps give optimal low-rank kernel approximations that deliver consistent GP posteriors, but the training objective does not obviously produce that optimality.

read the letter

The main takeaway is that this work tries to scale exact Gaussian process inference by training a neural network to produce feature maps for the kernel. The authors interpret those maps as an optimal low-rank approximation to a Gram matrix coming from an implied RKHS, and they use that to claim posterior consistency. They also examine the spectral properties of the resulting kernels and introduce product feature-map kernels to reduce oversmoothing on structured data such as images.

Referee Report

1 major / 1 minor

Summary. The manuscript proposes a Gaussian process framework that constructs expressive kernels via neural feature maps. It claims that the learned feature map admits an interpretation as an optimal low-rank approximation to a Gram matrix arising from an implied RKHS, from which posterior consistency is derived. The work further analyzes spectral properties of the induced kernels, introduces product feature-map kernels to counteract oversmoothing, and reports that the resulting exact GP inference is fast, scalable, and accurate on regression and classification benchmarks across tabular and structured data modalities.

Significance. If the central consistency argument can be made rigorous, the approach would offer a principled route to data-driven yet theoretically grounded kernels that support exact GP inference at scale. The combination of neural flexibility with posterior consistency and the proposed product kernels could be useful for practitioners working with non-stationary or high-dimensional data. The empirical claims of improved accuracy and efficiency are potentially valuable, but their weight depends on the resolution of the theoretical gap.

major comments (1)

[Abstract] Abstract: the claim that the learned neural feature map 'can be interpreted as an optimal low-rank approximation to a Gram matrix derived from an implied RKHS, from which we establish consistency of the GP posterior' is load-bearing. Standard neural-network training (via ELBO, cross-entropy, or similar) optimizes a different functional from the low-rank Gram approximation whose error controls posterior contraction rates. An explicit equivalence, inequality, or bound linking the training objective to the relevant approximation error must be supplied; without it the consistency statement does not follow from the low-rank interpretation alone.

minor comments (1)

[Abstract] Abstract: the statement that the method 'surpasses pre-existing methods in terms of accuracy and training and prediction efficiency' would benefit from naming the specific baselines, datasets, and quantitative metrics in the abstract itself.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their careful reading and constructive feedback. We appreciate the recognition of the potential value of neural feature maps for expressive yet consistent Gaussian process inference. The referee's primary concern focuses on the rigor of the consistency claim in the abstract, which we address directly below. We agree that an explicit link is necessary and will strengthen the manuscript accordingly.

read point-by-point responses

Referee: [Abstract] Abstract: the claim that the learned neural feature map 'can be interpreted as an optimal low-rank approximation to a Gram matrix derived from an implied RKHS, from which we establish consistency of the GP posterior' is load-bearing. Standard neural-network training (via ELBO, cross-entropy, or similar) optimizes a different functional from the low-rank Gram approximation whose error controls posterior contraction rates. An explicit equivalence, inequality, or bound linking the training objective to the relevant approximation error must be supplied; without it the consistency statement does not follow from the low-rank interpretation alone.

Authors: We acknowledge that the current manuscript does not supply an explicit inequality or bound connecting the neural network training objective (ELBO or cross-entropy) to the low-rank Gram-matrix approximation error that governs posterior contraction. The low-rank interpretation is derived from the representer theorem applied to the implied RKHS, but the optimization path from the training loss to this approximation error is left implicit. In the revised version we will add a new proposition (with proof) that provides a concrete bound: the excess risk of the learned feature map relative to the optimal low-rank approximant is controlled by the training objective plus a term that vanishes under standard assumptions on the neural network class. This will make the consistency argument rigorous and directly address the referee's point. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation relies on independent RKHS interpretation

full rationale

The paper claims that the learned neural feature map admits an interpretation as an optimal low-rank Gram approximation in an implied RKHS, from which posterior consistency is established. This step is presented as a theoretical consequence of the feature-map construction and standard RKHS approximation theory rather than a quantity fitted by construction or defined in terms of the target result. No equations in the abstract reduce the consistency claim to the training objective itself, no self-citations are invoked as load-bearing uniqueness theorems, and no ansatz is smuggled via prior work. The spectral analysis and product kernels are introduced as additional design choices, not as renamed empirical patterns. The central claim therefore retains independent theoretical content and does not collapse to its inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract provides no explicit free parameters, axioms, or invented entities; full text would be required to audit the RKHS interpretation, low-rank optimality, and consistency derivation.

pith-pipeline@v0.9.0 · 5406 in / 930 out tokens · 41124 ms · 2026-05-12T04:54:36.022894+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

59 extracted references · 59 canonical work pages

[1]

Proceedings of the Twelfth International Conference on Artificial Intelligence and Statistics , pages =

Variational Learning of Inducing Variables in Sparse Gaussian Processes , author =. Proceedings of the Twelfth International Conference on Artificial Intelligence and Statistics , pages =. 2009 , editor =

work page 2009
[2]

Gaussian processes for Big data , year =

Hensman, James and Fusi, Nicol\`. Gaussian processes for Big data , year =. Proceedings of the Twenty-Ninth Conference on Uncertainty in Artificial Intelligence , pages =

work page
[3]

32nd International Conference on Machine Learning, ICML 2015 , author =

Kernel interpolation for scalable structured. 32nd International Conference on Machine Learning, ICML 2015 , author =. 2015 , note =

work page 2015
[4]

, month = sep, year =

Gilboa, Elad and Saatçi, Yunus and Cunningham, John P. , month = sep, year =. Scaling

work page
[5]

and Novikov, Alexander V

Izmailov, Pavel A. and Novikov, Alexander V. and Kropotov, Dmitry A. , arxivId =. 2018 , booktitle =

work page 2018
[6]

Neural Computation , author =

A. Neural Computation , author =. 2000 , pmid =. doi:10.1162/089976600300014908 , abstract =

work page doi:10.1162/089976600300014908 2000
[7]

Advances in Neural Information Processing Systems , author =

Convolutional. Advances in Neural Information Processing Systems , author =. 2017 , note =. doi:10.17863/CAM.21271 , abstract =

work page doi:10.17863/cam.21271 2017
[8]

Kumar, Vinayak and Singh, Vaibhav and Srijith, P. K. and Damianou, Andreas , year =. Deep

work page
[9]

Dirichlet-based Gaussian Processes for Large-scale Calibrated Classification , url =

Milios, Dimitrios and Camoriano, Raffaello and Michiardi, Pietro and Rosasco, Lorenzo and Filippone, Maurizio , booktitle =. Dirichlet-based Gaussian Processes for Large-scale Calibrated Classification , url =

work page
[10]

Proceedings of the 34th International Conference on Machine Learning , pages =

On Calibration of Modern Neural Networks , author =. Proceedings of the 34th International Conference on Machine Learning , pages =. 2017 , editor =

work page 2017
[11]

2018 , eprint=

Neural Processes , author=. 2018 , eprint=

work page 2018
[12]

2018 , eprint=

Deep convolutional Gaussian processes , author=. 2018 , eprint=

work page 2018
[13]

2016 , eprint=

Dropout as a Bayesian Approximation: Appendix , author=. 2016 , eprint=

work page 2016
[14]

2022 , eprint=

Bayesian Neural Network Priors Revisited , author=. 2022 , eprint=

work page 2022
[15]

2021 , eprint=

Mat\'ern Gaussian Processes on Graphs , author=. 2021 , eprint=

work page 2021
[16]

2015 , editor =

Hensman, James and Matthews, Alexander and Ghahramani, Zoubin , booktitle =. 2015 , editor =

work page 2015
[17]

2006 , volume=

Kim, Hyun-Chul and Ghahramani, Zoubin , journal=. 2006 , volume=. doi:10.1109/TPAMI.2006.238 , url =

work page doi:10.1109/tpami.2006.238 2006
[18]

Leveraging Locality and Robustness to Achieve Massively Scalable Gaussian Process Regression , url =

Allison, Robert and Stephenson, Anthony and F, Samuel and Pyzer-Knapp, Edward O , booktitle =. Leveraging Locality and Robustness to Achieve Massively Scalable Gaussian Process Regression , url =

work page
[19]

Calibrated Reliable Regression using Maximum Mean Discrepancy , url =

Cui, Peng and Hu, Wenbo and Zhu, Jun , booktitle =. Calibrated Reliable Regression using Maximum Mean Discrepancy , url =

work page
[20]

2019 , journal =

Garriga-Alonso, Adrià and Aitchison, Laurence and Rasmussen, Carl Edward , pages =. 2019 , journal =. doi:10.17863/CAM.42340 , arxivId =

work page doi:10.17863/cam.42340 2019
[21]

2016 , eprint=

Manifold Gaussian Processes for Regression , author=. 2016 , eprint=

work page 2016
[22]

Proceedings of the 24th International Conference on Artificial Intelligence , pages =

Huang, Wenbing and Zhao, Deli and Sun, Fuchun and Liu, Huaping and Chang, Edward , title =. Proceedings of the 24th International Conference on Artificial Intelligence , pages =. 2015 , isbn =

work page 2015
[23]

https://stats.stackexchange.com/q/46615 , URL =

Expected value of a Gaussian random variable transformed with a logistic function , AUTHOR =. https://stats.stackexchange.com/q/46615 , URL =

work page
[24]

Chapter 4: The Matrix-Variate Gaussian Distribution

Mathai, Arak and Provost, Serge and Haubold, Hans. Chapter 4: The Matrix-Variate Gaussian Distribution. Multivariate Statistical Analysis in the Real and Complex Domains. 2022. doi:10.1007/978-3-030-95864-0_4

work page doi:10.1007/978-3-030-95864-0_4 2022
[25]

Proceedings of the 19th International Conference on Artificial Intelligence and Statistics , pages =

Deep Kernel Learning , author =. Proceedings of the 19th International Conference on Artificial Intelligence and Statistics , pages =. 2016 , editor =

work page 2016
[26]

Deep Neural Decision Forests , year=

Kontschieder, Peter and Fiterau, Madalina and Criminisi, Antonio and Bulò, Samuel Rota , booktitle=. Deep Neural Decision Forests , year=

work page
[27]

Proceedings of the 36th International Conference on Machine Learning , pages =

On the Spectral Bias of Neural Networks , author =. Proceedings of the 36th International Conference on Machine Learning , pages =. 2019 , editor =

work page 2019
[28]

2021 , journal =

Michael Kohler, Sophie Langer , pages =. 2021 , journal =

work page 2021
[29]

Anthony Stephenson , title =

work page
[30]

, biburl =

Stein, Michael L. , biburl =. Interpolation of spatial data , url =. doi:10.1007/978-1-4612-1494-6 , interhash =

work page doi:10.1007/978-1-4612-1494-6
[31]

2006 , TITLE =

Williams, Christopher K and Rasmussen, Carl Edward , PUBLISHER =. 2006 , TITLE =

work page 2006
[32]

Artificial Intelligence Review , author =

A survey of uncertainty in deep neural networks , volume =. Artificial Intelligence Review , author =. 2023 , pages =. doi:10.1007/s10462-023-10562-9 , abstract =

work page doi:10.1007/s10462-023-10562-9 2023
[33]

, booktitle =

Damianou, Andreas and Lawrence, Neil D. , booktitle =. Deep. 2013 , editor =

work page 2013
[34]

2020 , eprint=

Adaptive Approximation and Generalization of Deep Neural Network with Intrinsic Dimensionality , author=. 2020 , eprint=

work page 2020
[35]

Proceedings of The 24th International Conference on Artificial Intelligence and Statistics , pages =

A Spectral Analysis of Dot-product Kernels , author =. Proceedings of The 24th International Conference on Artificial Intelligence and Statistics , pages =. 2021 , editor =

work page 2021
[36]

Journal of Machine Learning Research , year =

Aad van der Vaart and Harry van Zanten , title =. Journal of Machine Learning Research , year =

work page
[37]

Advances in neural information processing systems , volume=

Gpytorch: Blackbox matrix-matrix gaussian process inference with gpu acceleration , author=. Advances in neural information processing systems , volume=

work page
[38]

Advances in Neural Information Processing Systems , volume=

Exact Gaussian processes on a million data points , author=. Advances in Neural Information Processing Systems , volume=

work page
[39]

2019 , eprint=

When Gaussian Process Meets Big Data: A Review of Scalable GPs , author=. 2019 , eprint=

work page 2019
[40]

Petersen, Kaare Brandt and Pedersen, Michael Syskind , year =. The. doi:10.1007/978-3-030-49840-5_1 , abstract =

work page doi:10.1007/978-3-030-49840-5_1
[41]

Forty-second International Conference on Machine Learning , year=

Scalable Gaussian Processes with Latent Kronecker Structure , author=. Forty-second International Conference on Machine Learning , year=

work page
[42]

Burt and Carl Edward Rasmussen and Mark van der Wilk , title =

David R. Burt and Carl Edward Rasmussen and Mark van der Wilk , title =. Journal of Machine Learning Research , year =

work page
[43]

Journal of Machine Learning Research , year =

Dennis Nieman and Botond Szabo and Harry van Zanten , title =. Journal of Machine Learning Research , year =

work page
[44]

2018 , eprint=

Gaussian Processes and Kernel Methods: A Review on Connections and Equivalences , author=. 2018 , eprint=

work page 2018
[45]

Fundamentals of Nonparametric Bayesian Inference , publisher=

Ghosal, Subhashis and van der Vaart, Aad , year=. Fundamentals of Nonparametric Bayesian Inference , publisher=

work page
[46]

2002 , eprint=

On choosing and bounding probability metrics , author=. 2002 , eprint=

work page 2002
[47]

and Johnson, Charles R

Horn, Roger A. and Johnson, Charles R. , isbn =. 1985 , booktitle =

work page 1985
[48]

1985 , issn =

On majorization and Schur products , journal =. 1985 , issn =. doi:https://doi.org/10.1016/0024-3795(85)90147-8 , url =

work page doi:10.1016/0024-3795(85)90147-8 1985
[49]

Lin and Allan Pinkus and Shimon Schocken , keywords =

Moshe Leshno and Vladimir Ya. Lin and Allan Pinkus and Shimon Schocken , keywords =. Multilayer feedforward networks with a nonpolynomial activation function can approximate any function , journal =. 1993 , issn =. doi:https://doi.org/10.1016/S0893-6080(05)80131-5 , url =

work page doi:10.1016/s0893-6080(05)80131-5 1993
[50]

, title =

Cybenko, G. , title =. Mathematics of Control, Signals and Systems , year =

work page
[51]

and Kakade, Sham M

Seeger, Matthias W. and Kakade, Sham M. and Foster, Dean P. , journal=. Information Consistency of Nonparametric Gaussian Process Methods , year=

work page
[52]

2021 , eprint=

MLP-Mixer: An all-MLP Architecture for Vision , author=. 2021 , eprint=

work page 2021
[53]

and Krzyzak, A

Kohler, M. and Krzyzak, A. , booktitle=. Adaptive regression estimation with multilayer feedforward neural networks , year=

work page
[54]

The Annals of Statistics , number =

Benedikt Bauer and Michael Kohler , title =. The Annals of Statistics , number =. 2019 , doi =

work page 2019
[55]

Nonparametric regression using deep neural networks with ReLU activation function , volume=

Schmidt-Hieber, Johannes , year=. Nonparametric regression using deep neural networks with ReLU activation function , volume=. The Annals of Statistics , publisher=. doi:10.1214/19-aos1875 , number=

work page doi:10.1214/19-aos1875
[56]

Stochastic Variational Deep Kernel Learning , url =

Wilson, Andrew G and Hu, Zhiting and Salakhutdinov, Russ R and Xing, Eric P , booktitle =. Stochastic Variational Deep Kernel Learning , url =

work page
[57]

Proceedings of the Thirty-Seventh Conference on Uncertainty in Artificial Intelligence , pages =

The promises and pitfalls of deep kernel learning , author =. Proceedings of the Thirty-Seventh Conference on Uncertainty in Artificial Intelligence , pages =. 2021 , editor =

work page 2021
[58]

2022 , eprint=

Why do tree-based models still outperform deep learning on tabular data? , author=. 2022 , eprint=

work page 2022
[59]

1985 , booktitle =

10 Harmonizable, Cramér, and Karhunen classes of processes , series =. 1985 , booktitle =. doi:https://doi.org/10.1016/S0169-7161(85)05012-X , url =

work page doi:10.1016/s0169-7161(85)05012-x 1985