pith. sign in

arxiv: 2606.11255 · v2 · pith:PGVJ7GGHnew · submitted 2026-06-08 · 💻 cs.LG

Bernstein-Schur Kernels: Random Features by Sketched Modulation and Radial Randomization

Pith reviewed 2026-06-27 17:07 UTC · model grok-4.3

classification 💻 cs.LG
keywords Bernstein-Schur kernelsrandom featurescompletely monotone kernelsnonstationary kernelskernel ridge regressionmatrix Bernstein boundsketchingradial randomization
0
0 comments X

The pith

Bernstein-Schur kernels admit random features by sketching their modulation and sampling the radial Bernstein-Widder scale before Gaussian Fourier features.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Bernstein-Schur kernels are products of a finite-feature kernel and a completely monotone shift-invariant kernel, so they sit between the shift-invariant and dot-product families where standard Bochner or polynomial random features do not apply directly. The paper supplies one construction that sketches the modulation factor to dimension m and draws the radial factor from its one-dimensional Bernstein-Widder representation before adding Gaussian random Fourier features, producing an overall map of size Dm. When the modulation is kept exact, the estimator is unbiased, its variance is given in closed form, and a matrix-Bernstein bound controls the operator norm using leading eigenvalues and an intrinsic dimension. Whitening at the ridge turns the effective dimension into the exact variance parameter, so a logarithmic number of radial draws suffices to preserve the kernel-ridge solution; the same guarantees carry over to the fully sketched estimator. The flagship example is the biased yat-kernel whose span contains the inverse-multiquadric kernel.

Core claim

We give one random-feature construction for the whole class that randomizes both factors: it sketches the finite modulation and samples the radial factor's one-dimensional Bernstein-Widder scale before applying Gaussian random Fourier features, giving feature dimension Dm, free of the O(d^2) size of the exact modulation feature. With the modulation kept exact (the m to infinity limit), we prove unbiasedness, an exact variance, and a matrix-Bernstein operator-norm bound controlled by the top kernel and modulation eigenvalues and an intrinsic dimension rather than the crude N max_ij route. Whitening this argument at the ridge makes the effective dimension d_eff(lambda) the exact intrinsic dime

What carries the argument

Sketched finite modulation combined with Bernstein-Widder sampling of the radial completely monotone factor, followed by Gaussian random Fourier features.

If this is right

  • Unbiasedness and an exact variance formula hold when the modulation is kept exact.
  • A matrix-Bernstein operator-norm bound is controlled by the leading eigenvalues and an intrinsic dimension.
  • After ridge whitening the effective dimension becomes the precise parameter in the variance bound.
  • O((1 + d_eff) log(d_eff / delta)) tilted radial draws suffice to preserve the kernel-ridge solution.
  • All concentration guarantees transfer to the doubly randomized estimator up to one additive sketch term.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The construction may apply to other product kernels that possess analogous finite-feature and completely monotone factors.
  • The effective-dimension sample complexity suggests the method scales better than uniform random-feature schemes when the kernel matrix is low-rank relative to the ambient dimension.
  • Numerical checks on the yat-kernel family could quantify the practical gap between the exact-modulation and sketched-modulation regimes.

Load-bearing premise

The kernels belong to the Bernstein-Schur class so that the Bernstein-Widder representation applies to the radial factor and the sketched modulation remains compatible with the subsequent Gaussian Fourier features.

What would settle it

For a concrete Bernstein-Schur kernel such as the biased yat-kernel, compute the empirical variance of many independent realizations of the proposed feature inner products and check whether it equals the exact variance formula stated for the exact-modulation case.

Figures

Figures reproduced from arXiv: 2606.11255 by Taha Bouhsine.

Figure 1
Figure 1. Figure 1: The key regime: an off-sphere bounded ball (varying norms), where kⵟ,b is genuinely non-dot￾product and no direct dot-product reduction is available. (a) RAY’s relative Frobenius Gram error follows the O(1/ √ D) Monte-Carlo rate at every dimension. (b) At D = 1000 RAY stays bounded as d grows, while uniform and k-means Nyström (fixed m = 100 landmarks) degrade (matched in radial/landmark count; the cost-ma… view at source ↗
Figure 2
Figure 2. Figure 2: Operator-norm error of the deployed (doubly-randomized) RAY estimator, validating Theorem 4.4 (off-sphere, d = 16, N = 300, kPkop = 186). (a) At fixed sketch size m=128, the radial term falls as O(1/ √ D) while the sketch term ηkPkop is a D-independent floor; the total decays to that floor, and the m → ∞ (exact-modulation) curve is the zero-floor limit. (b) The sketch term kEP ◦ Rkop and the relative sketc… view at source ↗
Figure 3
Figure 3. Figure 3: RAY as a linear-time, streaming ⵟ-attention primitive (random queries/keys/values, d = 32). (a) The linear-attention output and the induced attention-weight matrix both match exact ⵟ-attention with a median error that falls with the feature dimension M; one fixed map is applied to every token. (b) The one limitation: error scales with attention sharpness: diffuse attention (large radial scale ε) is easy, p… view at source ↗
Figure 4
Figure 4. Figure 4: Sphere-normalized sanity check (here the kernel coincides with a dot-product kernel, so this isolates the dimension behavior and is not a representation claim). RAY approximates the biased Gram at the Monte-Carlo rate with a radial sample count that grows little with dimension (flat D0 = 1). (a) Relative Frobenius error vs. D (N = 1000, b = 1, ε = 1); all dimensions track the O(1/ √ D) guide within a facto… view at source ↗
Figure 5
Figure 5. Figure 5: Estimator variance vs. the bias-shifted alignment x >w + b (log-log, 2000 repetitions). Both pairs follow a fourth power (fitted slopes 4.01 and 3.99 against the slope-4 guide). For the aligned pair the variance equals the (R2 + b) 4 envelope of Theorem A.1 (the ratio Var /(R2 + b) 4 is constant at ≈ 5 × 10−5 , so the bound is tight); the x >w = 0.5 pair lies below it, the gap being the Cauchy–Schwarz step… view at source ↗
Figure 6
Figure 6. Figure 6: Downstream KRR test metric vs. the number of random draws D on sphere-normalized real data (mean over 3 splits, ±1 s.d. bands); the dashed line is the exact ⵟ-kernel. (a) digits: RAY-ⵟ sits at the exact-kernel accuracy already at D = 8, while Gaussian RFF, IMQ-RFF, and Nyström climb slowly and need ∼ 512 features to catch up. (b) california: RAY tracks the exact kernel from the smallest budgets. RAY keeps … view at source ↗
Figure 7
Figure 7. Figure 7: Cost of fitting ridge regression vs. N (d = 8, D = m = 64, log-log). (a) Wall-clock: exact ridge steepens at the predicted ∼ N2 rate (fitted exponent 2.1) and is run only while feasible; RAY and Nyström grow gently. (b) Representation memory: the exact N × N Gram reaches 33 GB by N = 64,000 (above the dashed cap, where it no longer fits), while RAY (NM) and Nyström (Nm) stay linear in N. Exact ridge scales… view at source ↗
read the original abstract

Bernstein--Schur kernels are products of a finite-feature kernel and a completely monotone shift-invariant kernel: nonstationary kernels falling between the shift-invariant and dot-product templates random features exploit, so neither Bochner sampling nor polynomial sketching applies to the full kernel directly. We give one random-feature construction for the whole class that randomizes both factors: it sketches the finite modulation and samples the radial factor's one-dimensional Bernstein--Widder scale before applying Gaussian random Fourier features, giving feature dimension $Dm$, free of the $O(d^2)$ size of the exact modulation feature. With the modulation kept exact (the $m\to\infty$ limit), we prove unbiasedness, an exact variance, and a matrix-Bernstein operator-norm bound controlled by the top kernel and modulation eigenvalues and an intrinsic dimension rather than the crude $N\max_{ij}$ route. Whitening this argument at the ridge makes the effective dimension $d_{\mathrm{eff}}(\lambda)$ the \emph{exact} intrinsic dimension of the matrix variance, so $O((1+\|P\|_{\mathrm{op}}/\lambda)\log(d_{\mathrm{eff}}/\delta))$ radial draws preserve the kernel-ridge solution; tilting the draw by a closed-form whitened leverage improves this to the effective-dimension count $O((1+d_{\mathrm{eff}})\log(d_{\mathrm{eff}}/\delta))$. Conditioning on the sketch carries every guarantee to the deployed doubly-randomized estimator up to one additive sketch term, and all hold for the whole class with the modulation Gram in place of the polynomial one. The flagship instance is the biased $yat$-kernel $k_{yat,b}(w,x)=(w^\top x+b)^2/(\|w-x\|^2+\varepsilon)$, whose family span contains the inverse-multiquadric kernel by finite differences in $b$.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper defines Bernstein-Schur kernels as products of a finite-feature kernel and a completely monotone shift-invariant kernel. It presents a single random-feature construction that sketches the finite modulation (dimension m), samples the radial factor's Bernstein-Widder scale, and applies Gaussian random Fourier features, yielding feature dimension Dm independent of the O(d²) exact modulation size. With exact modulation (m→∞), the paper claims unbiasedness, an exact variance formula, and a matrix-Bernstein operator-norm bound controlled by the top eigenvalues of the kernel and modulation together with an intrinsic dimension; whitening at the ridge makes d_eff(λ) the exact variance proxy, yielding O((1 + ||P||_op/λ) log(d_eff/δ)) radial draws (improved to O((1 + d_eff) log(d_eff/δ)) by leverage tilting). Conditioning on the sketch extends all guarantees to the deployed estimator up to one additive sketch term, and the claims are asserted to hold for the entire class once the modulation Gram replaces the polynomial Gram. The flagship example is the biased yat-kernel whose span contains the inverse-multiquadric kernel.

Significance. If the central claims hold, the work supplies a unified random-feature scheme for a class of non-stationary kernels lying strictly between the shift-invariant and dot-product regimes, together with effective-dimension sample complexity that improves on crude N max_ij bounds. Explicit credit is due for the exact variance derivation, the whitening argument that makes d_eff(λ) the precise intrinsic dimension of the matrix variance, and the closed-form leverage tilt that achieves the effective-dimension count.

major comments (2)
  1. [Abstract] Abstract and the statement of the main construction: the claim that unbiasedness, exact variance, and the matrix-Bernstein bound carry over to the whole Bernstein-Schur class once the modulation Gram is substituted for the polynomial Gram rests on an unverified compatibility between an arbitrary finite-feature map φ and the subsequent radial-scale sampling; the product structure φ(x)^T φ(y) · ∫ exp(−t‖x−y‖²) dμ(t) does not automatically guarantee that cross terms vanish or that the variance proxy remains the claimed intrinsic dimension for non-polynomial φ.
  2. [Abstract] The matrix-Bernstein application (abstract): the operator-norm bound is asserted to be controlled by top eigenvalues and d_eff(λ) rather than N max_ij, yet the conditioning-on-sketch argument that extends the bound to the doubly-randomized estimator is stated only “up to one additive sketch term”; the precise additive term and the conditions under which it does not degrade the effective-dimension scaling are not exhibited.
minor comments (1)
  1. [Abstract] Notation: the symbol P appearing in the sample-complexity bound O((1 + ||P||_op/λ) …) is not defined in the abstract; its relation to the modulation or kernel operator should be stated explicitly.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the careful reading and constructive comments on the manuscript. We respond point-by-point to the two major comments below.

read point-by-point responses
  1. Referee: [Abstract] Abstract and the statement of the main construction: the claim that unbiasedness, exact variance, and the matrix-Bernstein bound carry over to the whole Bernstein-Schur class once the modulation Gram is substituted for the polynomial Gram rests on an unverified compatibility between an arbitrary finite-feature map φ and the subsequent radial-scale sampling; the product structure φ(x)^T φ(y) · ∫ exp(−t‖x−y‖²) dμ(t) does not automatically guarantee that cross terms vanish or that the variance proxy remains the claimed intrinsic dimension for non-polynomial φ.

    Authors: The proofs of unbiasedness, variance, and the matrix-Bernstein bound in Sections 3–4 are written directly in terms of the modulation kernel k_mod(x,y) = φ(x)^T φ(y) and its Gram operator; they invoke only the positive-semidefiniteness of this Gram and the independence of the radial-scale sampling from φ. Cross terms vanish because the radial measure μ is sampled independently of the modulation features, and the variance proxy is the sum of squared eigenvalues of the whitened combined operator, which depends only on the joint spectrum of the kernel and modulation Gram. The same algebraic steps therefore apply verbatim once the polynomial Gram is replaced by an arbitrary finite-feature Gram. We will insert a short clarifying paragraph in Section 2.2 confirming that no further assumptions on φ are needed. revision: yes

  2. Referee: [Abstract] The matrix-Bernstein application (abstract): the operator-norm bound is asserted to be controlled by top eigenvalues and d_eff(λ) rather than N max_ij, yet the conditioning-on-sketch argument that extends the bound to the doubly-randomized estimator is stated only “up to one additive sketch term”; the precise additive term and the conditions under which it does not degrade the effective-dimension scaling are not exhibited.

    Authors: Theorem 4.5 and the conditioning argument in Section 4.3 bound the additive sketch term by the operator norm of the sketch residual, which is at most O(√((log N)/m)) with probability 1−δ. When the sketch dimension satisfies m ≥ C d_eff(λ) log(1/δ), this additive term is absorbed into the leading O((1 + d_eff) log(d_eff/δ)) radial-sample count without altering the scaling. We will revise the abstract to state the additive term explicitly and add a one-sentence corollary summarizing the required relation between m and d_eff. revision: yes

Circularity Check

0 steps flagged

No circularity detected; derivations rely on external matrix-Bernstein inequality and effective-dimension definitions

full rationale

The paper's unbiasedness, variance, and operator-norm claims are stated to follow from the standard matrix-Bernstein inequality applied after substituting the modulation Gram for the polynomial Gram, together with the external definition of d_eff(λ). These ingredients are independent of the new construction and do not reduce any claimed result to a quantity defined inside the paper. No self-citation chains, self-definitional loops, or fitted-input predictions appear in the provided abstract or reader summary. The substitution step is presented as a direct replacement that preserves the external bounds; whether that substitution is valid is a correctness question, not a circularity reduction.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The construction rests on the Bernstein-Widder representation theorem for completely monotone functions and on the matrix-Bernstein concentration inequality; no free parameters or new entities are introduced in the abstract.

axioms (2)
  • domain assumption Bernstein-Widder representation theorem for completely monotone functions
    Invoked to sample the one-dimensional radial scale of the shift-invariant factor.
  • standard math Matrix Bernstein inequality for operator-norm concentration
    Used to obtain the operator-norm bound controlled by top eigenvalues and intrinsic dimension.

pith-pipeline@v0.9.1-grok · 5874 in / 1505 out tokens · 21317 ms · 2026-06-27T17:07:57.153573+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

112 extracted references · 6 canonical work pages · 2 internal anchors

  1. [1]

    Histoire de l'Acad\'emie Royale des Sciences , year =

    Charles-Augustin de Coulomb , title =. Histoire de l'Acad\'emie Royale des Sciences , year =

  2. [2]

    1835 , publisher =

    Carl Friedrich Gauss , title =. 1835 , publisher =

  3. [3]

    1687 , publisher =

    Isaac Newton , title =. 1687 , publisher =

  4. [4]

    International Conference on Machine Learning , pages=

    Deep kernel processes , author=. International Conference on Machine Learning , pages=

  5. [5]

    Transactions of the American mathematical society , volume=

    Theory of reproducing kernels , author=. Transactions of the American mathematical society , volume=

  6. [6]

    2011 , publisher=

    Reproducing Kernel Hilbert Spaces in Probability and Statistics , author=. 2011 , publisher=

  7. [7]

    2003 , publisher=

    Radial Basis Functions: Theory and Implementations , author=. 2003 , publisher=

  8. [8]

    International Conference on Machine Learning , pages=

    Language Modeling with Gated Convolutional Networks , author=. International Conference on Machine Learning , pages=

  9. [9]

    arXiv preprint arXiv:2110.06081 , year=

    On Expressivity and Trainability of Quadratic Networks , author=. arXiv preprint arXiv:2110.06081 , year=

  10. [10]

    Journal of Machine Learning Research , volume=

    A Kernel Two-Sample Test , author=. Journal of Machine Learning Research , volume=

  11. [11]

    2012 , publisher=

    Matrix Analysis , author=. 2012 , publisher=

  12. [12]

    International Conference on Learning Representations , year=

    Multiplicative Interactions and Where to Find Them , author=. International Conference on Learning Representations , year=

  13. [13]

    Philosophical Transactions of the Royal Society of London

    Functions of positive and negative type, and their connection with the theory of integral equations , author=. Philosophical Transactions of the Royal Society of London. Series A , volume=

  14. [14]

    Journal of Machine Learning Research , volume=

    Universal kernels , author=. Journal of Machine Learning Research , volume=

  15. [15]

    The Volume of Convex Bodies and

    Pisier, Gilles , year=. The Volume of Convex Bodies and

  16. [16]

    Advances in Neural Information Processing Systems , volume=

    Random features for large-scale kernel machines , author=. Advances in Neural Information Processing Systems , volume=

  17. [17]

    2002 , publisher=

    Learning with kernels: support vector machines, regularization, optimization, and beyond , author=. 2002 , publisher=

  18. [18]

    Journal of Machine Learning Research , volume=

    Hilbert Space Embeddings and Metrics on Probability Measures , author=. Journal of Machine Learning Research , volume=

  19. [19]

    2008 , publisher=

    Support Vector Machines , author=. 2008 , publisher=

  20. [20]

    2005 , publisher=

    Scattered Data Approximation , author=. 2005 , publisher=

  21. [21]

    1941 , publisher=

    The Laplace Transform , author=. 1941 , publisher=

  22. [22]

    Using the Nystr

    Williams, Christopher and Seeger, Matthias , booktitle=. Using the Nystr

  23. [23]

    Artificial Intelligence and Statistics , pages=

    Deep kernel learning , author=. Artificial Intelligence and Statistics , pages=

  24. [24]

    arXiv preprint arXiv:2204.01707 , year=

    Quadratic Neuron-empowered Heterogeneous Autoencoder for Unsupervised Anomaly Detection , author=. arXiv preprint arXiv:2204.01707 , year=

  25. [25]

    Action at a Distance: A Universal Reproducing Kernel

    Bouhsine, Taha , year=. Action at a Distance: A Universal Reproducing Kernel

  26. [26]

    Kernel Neurons: Turning the Hidden Layer into an Observable

    Bouhsine, Taha , year=. Kernel Neurons: Turning the Hidden Layer into an Observable

  27. [27]

    2026 , note=

    Yat-Attention: Alignment-Locality Coupling in Transformer Architectures , author=. 2026 , note=

  28. [28]

    Drop the

    Bouhsine, Taha , year=. Drop the

  29. [29]

    Non-Vacuous Generalisation Bounds for Deep Networks via Composable Per-Layer

    Bouhsine, Taha , year=. Non-Vacuous Generalisation Bounds for Deep Networks via Composable Per-Layer

  30. [30]

    Prototype Self-Decoding: Reading

    Bouhsine, Taha , year=. Prototype Self-Decoding: Reading

  31. [31]

    Game of Tokens:

    Bouhsine, Taha , year=. Game of Tokens:

  32. [32]

    Advances in Neural Information Processing Systems , year=

    Augmenting self-attention with persistent memory , author=. Advances in Neural Information Processing Systems , year=

  33. [33]

    International Conference on Machine Learning , pages=

    Attention is not all you need: Pure attention loses rank doubly exponentially with depth , author=. International Conference on Machine Learning , pages=. 2021 , organization=

  34. [34]

    Interpreting

    nostalgebraist , year=. Interpreting

  35. [35]

    Eliciting Latent Predictions from Transformers with the Tuned Lens

    Eliciting latent predictions from transformers with the tuned lens , author=. arXiv preprint arXiv:2303.08112 , year=

  36. [36]

    International Conference on Machine Learning , year=

    Patchscopes: A unifying framework for inspecting hidden representations of language models , author=. International Conference on Machine Learning , year=

  37. [37]

    Scaling Monosemanticity: Extracting Interpretable Features from

    Templeton, Adly and Conerly, Tom and Marcus, Jonathan and Lindsey, Jack and Bricken, Trenton and Chen, Brian and Pearce, Adam and Citro, Craig and Ameisen, Emmanuel and Jones, Andy and others , year=. Scaling Monosemanticity: Extracting Interpretable Features from

  38. [38]

    2023 , howpublished=

    Activation Addition: Steering Language Models Without Optimization , author=. 2023 , howpublished=

  39. [39]

    2026 , note=

    On the Gradient Bottleneck of the Softmax Language-Modelling Head , author=. 2026 , note=

  40. [40]

    , booktitle=

    Yang, Zhilin and Dai, Zihang and Salakhutdinov, Ruslan and Cohen, William W. , booktitle=. Breaking the Softmax Bottleneck: A High-Rank

  41. [41]

    2024 , note=

    Spectral Decay and Rank Collapse in Deep Softmax Transformers , author=. 2024 , note=

  42. [42]

    International Conference on Learning Representations (ICLR) , year=

    Representation Degeneration Problem in Training Natural Language Generation Models , author=. International Conference on Learning Representations (ICLR) , year=

  43. [43]

    Advances in Neural Information Processing Systems (NeurIPS) , year=

    Sigsoftmax: Reanalysis of the Softmax Bottleneck , author=. Advances in Neural Information Processing Systems (NeurIPS) , year=

  44. [44]

    Theory of Probability and Its Applications , volume=

    On Estimating Regression , author=. Theory of Probability and Its Applications , volume=

  45. [45]

    Smooth Regression Analysis , author=. Sankhy

  46. [46]

    Smola, Alex and Gretton, Arthur and Song, Le and Sch. A. Algorithmic Learning Theory (ALT) , pages=. 2007 , publisher=

  47. [47]

    Empirical Methods in Natural Language Processing (EMNLP) , year=

    Transformer Dissection: A Unified Understanding for Transformer's Attention via the Lens of Kernel , author=. Empirical Methods in Natural Language Processing (EMNLP) , year=

  48. [48]

    International Conference on Learning Representations (ICLR) , year=

    Efficient Streaming Language Models with Attention Sinks , author=. International Conference on Learning Representations (ICLR) , year=

  49. [49]

    Neural Computation , volume=

    Fast learning in networks of locally-tuned processing units , author=. Neural Computation , volume=

  50. [50]

    Advances in Neural Information Processing Systems (NeurIPS) , year=

    This Looks Like That: Deep Learning for Interpretable Image Recognition , author=. Advances in Neural Information Processing Systems (NeurIPS) , year=

  51. [51]

    Chau, Siu Lun and Hu, Robert and Gonzalez, Javier and Sejdinovic, Dino , journal=

  52. [52]

    Proceedings of the National Academy of Sciences (PNAS) , volume=

    Prevalence of neural collapse during the terminal phase of deep learning training , author=. Proceedings of the National Academy of Sciences (PNAS) , volume=

  53. [53]

    2016 , howpublished=

    Understanding intermediate layers using linear classifier probes , author=. 2016 , howpublished=

  54. [54]

    2023 , howpublished=

    Sparse Autoencoders Find Highly Interpretable Features in Language Models , author=. 2023 , howpublished=

  55. [55]

    Neural Computation , volume=

    On Learning Vector-Valued Functions , author=. Neural Computation , volume=

  56. [56]

    International Conference on Machine Learning (ICML) , pages=

    Hilbert Space Embeddings of Conditional Distributions with Applications to Dynamical Systems , author=. International Conference on Machine Learning (ICML) , pages=

  57. [57]

    International Conference on Machine Learning (ICML) , pages=

    Conditional Mean Embeddings as Regressors , author=. International Conference on Machine Learning (ICML) , pages=

  58. [58]

    Journal of Machine Learning Research , volume=

    Dimensionality Reduction for Supervised Learning with Reproducing Kernel Hilbert Spaces , author=. Journal of Machine Learning Research , volume=

  59. [59]

    Advances in Neural Information Processing Systems (NeurIPS) , year=

    Kernel Methods for Deep Learning , author=. Advances in Neural Information Processing Systems (NeurIPS) , year=

  60. [60]

    Foundations of Computational Mathematics , volume=

    Optimal Rates for the Regularized Least-Squares Algorithm , author=. Foundations of Computational Mathematics , volume=

  61. [61]

    Advances in Neural Information Processing Systems (NeurIPS) , year=

    A Measure-Theoretic Approach to Kernel Conditional Mean Embeddings , author=. Advances in Neural Information Processing Systems (NeurIPS) , year=

  62. [62]

    International Conference on Machine Learning (ICML) , year=

    Invertible Residual Networks , author=. International Conference on Machine Learning (ICML) , year=

  63. [63]

    International Conference on Artificial Intelligence and Statistics (AISTATS) , year=

    Sinkformers: Transformers with Doubly Stochastic Attention , author=. International Conference on Artificial Intelligence and Statistics (AISTATS) , year=

  64. [64]

    Advances in Neural Information Processing Systems (NeurIPS) , year=

    Universal Kernels on Non-Standard Input Spaces , author=. Advances in Neural Information Processing Systems (NeurIPS) , year=

  65. [65]

    Advances in Neural Information Processing Systems (NeurIPS) , year=

    Neural Tangent Kernel: Convergence and Generalization in Neural Networks , author=. Advances in Neural Information Processing Systems (NeurIPS) , year=

  66. [66]

    Advances in Neural Information Processing Systems (NeurIPS) , year=

    The Emergence of Clusters in Self-Attention Dynamics , author=. Advances in Neural Information Processing Systems (NeurIPS) , year=

  67. [67]

    2026 , note=

    Two Measures, One Operator: Attention and Feedforward as a Kernel Conditional-Mean Embedding , author=. 2026 , note=

  68. [68]

    Journal of Machine Learning Research , volume=

    On the Equivalence between Kernel Quadrature Rules and Random Feature Expansions , author=. Journal of Machine Learning Research , volume=

  69. [69]

    Advances in Neural Information Processing Systems (NeurIPS) , year=

    Generalization Properties of Learning with Random Features , author=. Advances in Neural Information Processing Systems (NeurIPS) , year=

  70. [70]

    Towards a Unified Analysis of Random

    Li, Zhu and Ton, Jean-Francois and Oglic, Dino and Sejdinovic, Dino , journal=. Towards a Unified Analysis of Random

  71. [71]

    ACM-SIAM Symposium on Discrete Algorithms (SODA) , year=

    Oblivious Sketching of High-Degree Polynomial Kernels , author=. ACM-SIAM Symposium on Discrete Algorithms (SODA) , year=

  72. [72]

    Advances in Neural Information Processing Systems (NeurIPS) , year=

    Subspace Embeddings for the Polynomial Kernel , author=. Advances in Neural Information Processing Systems (NeurIPS) , year=

  73. [73]

    Avron, Haim and Kapralov, Michael and Musco, Cameron and Musco, Christopher and Velingker, Ameya and Zandieh, Amir , journal=. Quasi-

  74. [74]

    Avron, Haim and Kapralov, Michael and Musco, Cameron and Musco, Christopher and Velingker, Ameya and Zandieh, Amir , booktitle=. Random

  75. [75]

    Nature Communications , volume=

    Searching for Exotic Particles in High-Energy Physics with Deep Learning , author=. Nature Communications , volume=

  76. [76]

    Action at a Distance: A Universal Reproducing Kernel

    Bouhsine, Taha , howpublished=. Action at a Distance: A Universal Reproducing Kernel

  77. [77]

    Rethinking Attention with

    Choromanski, Krzysztof and Likhosherstov, Valerii and Dohan, David and Song, Xingyou and Gane, Andreea and Sarl. Rethinking Attention with. International Conference on Learning Representations (ICLR) , year=

  78. [78]

    Machine Learning , volume=

    Support-Vector Networks , author=. Machine Learning , volume=

  79. [79]

    Random Features for Compositional Kernels

    Random Features for Compositional Kernels , author=. arXiv preprint arXiv:1703.07872 , year=

  80. [80]

    Han, Insu and Zandieh, Amir and Avron, Haim , booktitle=. Random

Showing first 80 references.