LAPLEX: The FFT of Learnable Laplace Kernels

Hanna Blazhko; Jacek Tabor; {\L}ukasz Struski; Piotr Kubaty

arxiv: 2605.24584 · v1 · pith:EXXJOD7Knew · submitted 2026-05-23 · 💻 cs.LG · cs.AI

LAPLEX: The FFT of Learnable Laplace Kernels

{\L}ukasz Struski , Hanna Blazhko , Piotr Kubaty , Jacek Tabor This is my paper

Pith reviewed 2026-06-30 13:57 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords LAPLEXlearnable Laplace kernelsFFT scalingdense matrix operatorskernel methodshigh-dimensional learningtrainable linear layers

0 comments

The pith

LAPLEX defines full-rank dense matrices via learnable Laplace kernels that scale like the FFT.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces LAPLEX as a class of exact trainable phased Laplace-kernel operators. It seeks to overcome the usual choice between fixed exact transforms like the Fourier transform and adaptive but costly or approximate dense layers. By anchoring the kernels at learnable coordinates, LAPLEX produces an implicit dense matrix that remains exactly computable through FFT-like operations. This supports matrix-vector multiplies at dimensions up to 10^9 while keeping the parameter count small. The result is that data-adaptive global linear interactions become feasible in regimes where ordinary dense layers cannot be stored or applied.

Core claim

A LAPLEX layer is a typically full-rank dense matrix, implicitly defined by learnable coordinate anchors, with FFT-like scaling. Consequently, it supports trainable matrix-vector operations at vector dimensions up to 10^9 on modern GPUs. As a neural layer it yields compact projections and classification heads interpretable as soft trainable routing models; the same primitive also serves as an efficient Gram operator for high-dimensional covariance models on flattened images of dimension 3*10^6 that preserve visible spatial structure without convolutional bias.

What carries the argument

The phased Laplace kernel operator implicitly defined by learnable coordinate anchors, which produces a full-rank dense matrix yet admits exact FFT-like application.

If this is right

Trainable matrix-vector operations become feasible at vector dimensions up to 10^9 on modern GPUs.
Compact projections and classification heads can be interpreted as soft trainable routing models.
High-dimensional covariance models can be fit to flattened images of dimension 3*10^6 while preserving visible spatial structure without convolutional bias.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same anchor-based construction might be tried with other kernel families to obtain analogous FFT-style operators.
Applications could extend beyond images to other high-dimensional structured data such as long sequences or point clouds.
The method's practical scaling limit could be tested by running matrix-vector products at dimensions between 10^8 and 10^9 on current hardware.

Load-bearing premise

A phased Laplace kernel defined by learnable coordinate anchors remains exactly computable via FFT-like operations while preserving full-rank dense behavior.

What would settle it

Form the explicit dense matrix for a vector dimension of a few thousand using the same anchors, compute its matrix-vector product directly, and compare the result to the LAPLEX FFT-based output; any discrepancy larger than floating-point error would falsify the exactness claim.

Figures

Figures reproduced from arXiv: 2605.24584 by Hanna Blazhko, Jacek Tabor, {\L}ukasz Struski, Piotr Kubaty.

**Figure 1.** Figure 1: Runtime of LAPLEX, Dense, and torch.fft (cuFFT) versus the number of elements n (log2 scale) on NVIDIA H100 (solid) and RTX 2080 (dashed). Shaded regions show standard deviation. LAPLEX scales efficiently to large n, cuFFT is consistently fastest, while Dense fails at relatively small sizes. Our idea: index-level operators that can learn their input geometry. We pursue operators that retain the memory adv… view at source ↗

**Figure 2.** Figure 2: Illustration of Theorem 1: exact decomposition of the general Laplace kernel matrix–vector [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Forward time (left), forward+backward time (middle), and peak GPU memory of the forward call (right) for XK(a, b) at batch size B = 8. LAPLEX (blue) scales smoothly to n = 220. The explicit dense baseline (orange) runs out of memory above n = 214 (dashed vertical line). At the largest size where both run, LAPLEX is ∼ 65× faster on the forward pass, ∼ 430× faster on forward+backward, and uses ∼100× less pea… view at source ↗

**Figure 4.** Figure 4: DIV2K density at n = 3,145,728. Matched ∼ 14M-parameter budget: LAPLEX (rank 1000) vs. dense low-rank (rank 4). Top: input and MAP reconstruction; bottom: unconditional samples with validation log-likelihood. LAPLEX dominates the dense baseline by a 4.5× factor in test LL. Setup. We use a low-rank-plus-diagonal Gaussian template x = m + F z + d ⊙ ε, with z ∼ N (0, Ik) and ε ∼ N (0, In). At a matched para… view at source ↗

**Figure 5.** Figure 5: GPU runtime for a single matrix–vector prod [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

**Figure 6.** Figure 6: RFF vs. LAPLEX. (Left) Runtime as a function of D. RFF is linear in the feature count; LAPLEX is independent of D (∼ 0.5 ms, dashed). The runtime crossover occurs near D ≈ 16. (Right) Relative ℓ2 approximation error of RFF (median and [min, max] band over 10 seeds) with the 1/ √ D reference. At the runtime crossover RFF still has ∼72% median error; below 10% error requires D ≥ 1024, where RFF is ∼75× slowe… view at source ↗

**Figure 7.** Figure 7: Structure of the exact Gram matrix M = A diag(D)A⊤ for the Laplace kernel. Each entry Mij (shown for i ≥ j) admits a decomposition into three one-dimensional aggregated contributions: a left prefix term Uj capturing accumulated influence from indices up to aj , an interval mass (Wi−Wj ) corresponding to a prefix-suffix decomposition over the interval [aj , ai ], and a right suffix term Vi encoding contribu… view at source ↗

**Figure 8.** Figure 8: GPU benchmark of XK(a, b) as a function of n = k at B = 8. Forward time (left), forward+backward time (middle), and peak GPU memory of the forward call (right). LAPLEX is shown in blue; the explicit-dense baseline is shown in orange and runs out of memory above n = 214 (dashed vertical line). At n = 214 the forward speed-up is ∼65×, the forward+backward speed-up exceeds ∼ 430×, and the peak memory reductio… view at source ↗

**Figure 9.** Figure 9: Relative error against an float64 dense reference. Left: ℓ2 relative error. Right: ℓ∞ relative error. The dashed line marks the float32 machine epsilon ε ≈ 1.19 · 10−7 . LAPLEX sits at the unit-of-last-place noise floor across the entire range, while the explicit dense float32 matmul accumulates error proportional to n, exceeding LAPLEX by a factor of 2–5× at n = 214 . D.3 Batch-size scaling Setup. We fix … view at source ↗

**Figure 10.** Figure 10: Batch-size scaling at n = k = 214 . Forward (left) and forward+backward (right). The dense baseline is nearly B-independent at moderate B because its cost is dominated by building K (independent of B); only at the very largest B does the actual matmul become the bottleneck. LAPLEX grows linearly with B once B ≳ 8, with a small constant offset for kernel-launch overhead. The forward+backward column is empt… view at source ↗

**Figure 11.** Figure 11: Asymmetric kernels. Left: n = 1024 fixed, k varies (branch A). Right: k = 1024 fixed, n varies (branch B). LAPLEX is essentially constant in the swept axis up to ∼2 16 (cost is dominated by the fixed-axis scan), then grows linearly with the long axis. The dense baseline grows linearly throughout and OOMs above ∼2 18 . Observations. With n = 1024 fixed, LAPLEX’s forward stays under 1.1 ms up to k = 217 and… view at source ↗

**Figure 12.** Figure 12: Weighted Gram matrix scaling. Left: forward time. Right: peak GPU memory. n = 1024 is fixed; k varies. Up to k ≈ 2 17 the cost of LAPLEX is dominated by the O(n 2 ) output construction and is therefore independent of k. The dense baseline grows linearly in k in both axes and OOMs above k = 217 (the dense K matrix alone exceeds 2 GB at k = 219). Observations. The flat plateau in the forward-time panel is t… view at source ↗

**Figure 13.** Figure 13: CPU benchmark. Pure-PyTorch LAPLEX versus explicit dense, both running on CPU. The pure-PyTorch LAPLEX scales close to linearly in n (logcumsumexp is well-vectorised in PyTorch’s CPU backend) while the dense baseline grows quadratically. At n = 214 the LAPLEX forward is 108× faster than the explicit dense. Observations. The pure-PyTorch path is intentionally several times slower than the custom CUDA backe… view at source ↗

**Figure 14.** Figure 14: FFT comparison, float32. (a) GPU runtime for a single matrix–vector product on irregular anchors (B = 1, n = k, a, b ∼ N (0, 1)): LAPLEX (blue), Toeplitz–FFT on its fixed-grid specialization (orange), and a single cuFFT call as a hardware lower bound (green). All three curves follow the O(n log n) reference. (b) Relative ℓ2 error against a float64 dense reference on the uniform-grid Toeplitz problem where… view at source ↗

**Figure 15.** Figure 15: Runtime and peak memory usage of LAPLEX, Dense, and torch.fft.fft (cuFFT backend) as a function of the number of elements n (log2 scale), evaluated on an NVIDIA H100 (solid lines) and an RTX 2080 (dashed lines). Shaded regions denote the standard deviation across runs. Left: LAPLEX achieves near-constant runtime of ≈0.34 ms on the H100 for n ≤ 2 19, scaling to ≈704 ms at n = 229. The torch.fft.fft baselin… view at source ↗

**Figure 16.** Figure 16: Numerical accuracy on a uniform-grid Toeplitz problem. Relative ℓ2 (left) and ℓ∞ (right) error against an float64 dense reference. All three float32 routes operate on the same input tensors. Toeplitz–FFT improves on the dense float32 matmul by a factor of ∼1.5–2×; LAPLEX improves on Toeplitz–FFT by another ∼1.5×, despite making no use of the uniformity of the grid. The dashed line marks the float32 machin… view at source ↗

**Figure 17.** Figure 17: CPU runtime. All three methods on a single CPU thread, log–log axes. The dashed line is an O(n log n) reference fitted to the last torch.fft.fft measurement. All curves are parallel to that reference, confirming that the matching-complexity result of the GPU experiment is hardwareindependent. Observations. At n = 218 the wall-clock costs are LAPLEX ≈ 64 ms, Toeplitz–FFT≈ 6 ms, torch.fft.fft≈0.65 ms. The … view at source ↗

**Figure 18.** Figure 18: RFF gradient accuracy on the Laplace kernel. n = k = 216, median (line) with [min, max] range (shaded) over 10 seeds. Left: relative error of the value, ∇aL, and ∇bL versus D; the dashed line is the O(1/ √ D) reference established for the value. The value error tracks the reference, the a-gradient lags, and the b-gradient does not measurably improve with D. Right: worst-case-to-median error ratio across s… view at source ↗

**Figure 19.** Figure 19: further illustrates training dynamics. Across all ranks and both backbones, LAPLEX(r) exhibits faster convergence and consistently lower training and validation loss compared to LowRank(r) , reaching equal or better final values [PITH_FULL_IMAGE:figures/full_fig_p029_19.png] view at source ↗

**Figure 20.** Figure 20: ImageNette-320 density: random samples drawn from four Gaussian models, all conditioned on the same Welford-frozen 307,200-vector mean m, ordered left to right by trainable budget and (equivalently for LAPLEX) by component count. Left: full-empirical-Gaussian ceiling Σε=Σ +ˆ εI with val-optimal ε=10−2 (∼ 2.9 B effective parameters). Centre-left: LAPLEX single-component (I=1, ∼2.2M trainable). Centre-rig… view at source ↗

**Figure 21.** Figure 21: LAPLEX MAP reconstructions on ten held-out ImageNette-320 images, one uniformlyrandom sample from each of the ten classes (tench, English springer, cassette player, chain saw, church, French horn, garbage truck, gas pump, golf ball, parachute; seed 1234). Row 1: the original 320×320 inputs (no preprocessing beyond resize-and-center-crop, flattened to 307,200-vectors before hitting the model), labelled wi… view at source ↗

read the original abstract

Fast linear algebra in deep learning usually comes with a choice: fixed geometry and exact computation, as in the Fourier transform, or adaptive geometry paid for by dense parameters, random features, or low-rank surrogates. To move beyond this trade-off, we introduce LAPLEX, a class of exact, trainable (phased) Laplace-kernel operators. A LAPLEX layer is a typically full-rank dense matrix, implicitly defined by learnable coordinate anchors, with FFT-like scaling. Consequently, it supports trainable matrix--vector operations at vector dimensions up to $10^9$ on modern GPUs. As a neural layer, it yields compact projections and classification heads interpretable as soft, trainable routing models. The same primitive also serves as an efficient Gram operator, enabling high-dimensional covariance models on flattened images of dimension $3 \cdot 10^6$ that preserve visible spatial structure without imposing convolutional bias. These applications reflect a single principle: dense geometry can be learned without storing a dense matrix, which enables data-adaptive global interactions in regimes where ordinary dense layers are out of reach. In this sense, LAPLEX separates expressivity from storage cost: it behaves like a dense trainable matrix, but is represented and applied through a small structured set of parameters.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

LAPLEX claims exact trainable full-rank matrices at 10^9 scale via learnable Laplace kernels and FFT-like ops, but the abstract gives no derivation so the central assumption stays untested.

read the letter

The main takeaway is that this paper puts forward LAPLEX as a structured operator that acts like a dense trainable matrix but uses a small number of learnable coordinate anchors to define a phased Laplace kernel, then applies it at FFT-like cost. That lets it handle matrix-vector products on vectors up to a billion entries without storing the matrix.

What stands out as new is the specific choice to make the Laplace kernel phased and anchor-based so the parameters stay few while the effective interaction stays full-rank and data-adaptive. The applications sketched—compact classification heads that look like soft routing, and Gram matrices on flattened 3-million-dimensional images that keep spatial structure—are direct consequences of that construction. The framing around escaping the fixed-geometry versus dense-parameter trade-off is clean and the scaling numbers are stated plainly.

The soft spot is the lack of any derivation or error bound in the abstract for why the phased kernel remains exactly FFT-computable once the anchors are learned. The claim that full rank is preserved while the computation stays exact rests on an assumption that is not shown here; without the equations it is impossible to check whether the FFT step introduces hidden approximations or rank collapse. The stress-test note correctly flags that the full manuscript would be needed to inspect the operator definition and any verification.

This is aimed at researchers building large-scale layers for images or sequences who already work with structured matrices or kernel approximations. A reader who cares about exact versus approximate fast linear algebra would find the concrete scaling claims useful to examine. The work shows clear thinking about the memory-expressivity tension and engages the relevant literature on FFT and kernels, so it is worth sending to referees even if the math details require tightening.

Referee Report

1 major / 0 minor

Summary. The paper introduces LAPLEX as a class of exact, trainable phased Laplace-kernel operators. A LAPLEX layer is described as a typically full-rank dense matrix implicitly defined by learnable coordinate anchors, achieving FFT-like scaling to support trainable matrix-vector operations at dimensions up to 10^9. Applications include neural network layers for compact projections and classification heads, as well as efficient Gram operators for high-dimensional covariance models on flattened images while preserving spatial structure without convolutional bias. The core principle is that dense geometry can be learned without storing a dense matrix.

Significance. If the central claims of exactness, full-rank behavior, and FFT-like scaling hold, the work would enable data-adaptive global interactions in deep learning at scales where ordinary dense layers are infeasible, separating expressivity from storage cost in a manner distinct from fixed-geometry transforms or low-rank approximations.

major comments (1)

[Abstract] Abstract: The claims that the operator is 'exact' and supports 'FFT-like scaling' while remaining a 'typically full-rank dense matrix' implicitly defined by learnable coordinate anchors are asserted without any derivation of the phased Laplace kernel, definition of the FFT-like algorithm, error analysis, rank verification, or proof of exactness. This is load-bearing for the central claim, as the entire contribution rests on these properties.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for highlighting the need for explicit support of the core claims. The abstract is a high-level summary; the manuscript provides the requested derivations, definitions, analyses, and proofs in the main body. We address the comment below and will make a targeted revision to the abstract.

read point-by-point responses

Referee: [Abstract] Abstract: The claims that the operator is 'exact' and supports 'FFT-like scaling' while remaining a 'typically full-rank dense matrix' implicitly defined by learnable coordinate anchors are asserted without any derivation of the phased Laplace kernel, definition of the FFT-like algorithm, error analysis, rank verification, or proof of exactness. This is load-bearing for the central claim, as the entire contribution rests on these properties.

Authors: The abstract summarizes results whose supporting material appears in the body. Section 2 derives the phased Laplace kernel from first principles and shows how learnable anchors define the operator. Section 3 defines the FFT-like algorithm, proves its exact equivalence to the dense kernel matrix-vector product, and gives the O(n log n) complexity. Section 4 supplies the error analysis (including floating-point and approximation bounds), a rank theorem establishing that the operator is typically full rank for generic anchor placements, and numerical verification across dimensions up to 10^9. We will revise the abstract to add one sentence directing readers to these sections for the derivations and proofs. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

The abstract introduces LAPLEX as an implicitly defined operator via learnable anchors with FFT-like scaling, but supplies no equations, derivations, or self-citations that reduce any claimed prediction or uniqueness result to its own inputs by construction. No load-bearing steps match the enumerated patterns of self-definition, fitted-input renaming, or ansatz smuggling; the central claim of separating expressivity from storage remains independent of any internal fit or prior author result within the provided text. This is the common honest outcome for papers whose core construction is presented without visible reduction to tautology.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

Review performed on abstract only; free parameters consist of the learnable coordinate anchors whose values are fitted during training. The applicability of FFT to the phased Laplace kernel is treated as a domain assumption without further justification visible in the abstract.

free parameters (1)

coordinate anchors
Learnable parameters that implicitly define the dense matrix; their number and initialization are not specified in the abstract.

axioms (1)

domain assumption Phased Laplace kernel admits exact FFT-like fast multiplication
Invoked to achieve FFT-like scaling while remaining exact and trainable.

pith-pipeline@v0.9.1-grok · 5760 in / 1269 out tokens · 46194 ms · 2026-06-30T13:57:43.786511+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

35 extracted references · 1 canonical work pages · 1 internal anchor

[1]

Database-friendly random projections: Johnson–Lindenstrauss with binary coins.Journal of Computer and System Sciences, 66(4):671–687, 2003

Dimitris Achlioptas. Database-friendly random projections: Johnson–Lindenstrauss with binary coins.Journal of Computer and System Sciences, 66(4):671–687, 2003

2003
[2]

Finding frequent items in data streams

Moses Charikar, Kevin Chen, and Martin Farach-Colton. Finding frequent items in data streams. InInternational Colloquium on Automata, Languages, and Programming (ICALP), 2002

2002
[3]

Finding frequent items in data streams

Moses Charikar, Kevin Chen, and Martin Farach-Colton. Finding frequent items in data streams. Theoretical Computer Science, 312(1):3–15, 2004

2004
[4]

Rethinking attention with Performers

Krzysztof Choromanski, Valerii Likhosherstov, David Dohan, Xingyou Song, Andreea Gane, Tamás Sarlós, Peter Hawkins, Jared Davis, Afroz Mohiuddin, Lukasz Kaiser, David Belanger, Lucy Colwell, and Adrian Weller. Rethinking attention with Performers. InInternational Conference on Learning Representations (ICLR), 2021

2021
[5]

Cooley and John W

James W. Cooley and John W. Tukey. An algorithm for the machine calculation of complex fourier series.Mathematics of Computation, 19(90):297–301, 1965

1965
[6]

Muthukrishnan

Graham Cormode and S. Muthukrishnan. An improved data stream summary: The count-min sketch and its applications.Journal of Algorithms, 55(1):58–75, 2005

2005
[7]

Petros Drineas and Michael W. Mahoney. On the nyström method for approximating a gram matrix for improved kernel-based learning.Journal of Machine Learning Research, 6:2153– 2175, 2005

2005
[8]

Dutt and V

A. Dutt and V . Rokhlin. Fast Fourier transforms for nonequispaced data.SIAM Journal on Scientific Computing, 14(6):1368–1393, 1993

1993
[9]

The lottery ticket hypothesis: Finding sparse, trainable neural networks

Jonathan Frankle and Michael Carbin. The lottery ticket hypothesis: Finding sparse, trainable neural networks. InInternational Conference on Learning Representations (ICLR), 2019

2019
[10]

SparseGPT: Massive language models can be accurately pruned in one-shot

Elias Frantar and Dan Alistarh. SparseGPT: Massive language models can be accurately pruned in one-shot. InInternational Conference on Machine Learning (ICML), 2023

2023
[11]

The design and implementation of fftw3.Proceedings of the IEEE, 93(2):216–231, 2005

Matteo Frigo and Steven G Johnson. The design and implementation of fftw3.Proceedings of the IEEE, 93(2):216–231, 2005

2005
[12]

The em algorithm for mixtures of factor analyzers

Zoubin Ghahramani and Geoffrey E Hinton. The em algorithm for mixtures of factor analyzers. CRG-TR-96-1, University of Toronto, 1996

1996
[13]

Revisiting the nyström method for improved large-scale machine learning.Journal of Machine Learning Research, 17(117):1–65, 2016

Alex Gittens and Michael W Mahoney. Revisiting the nyström method for improved large-scale machine learning.Journal of Machine Learning Research, 17(117):1–65, 2016

2016
[14]

Gray.Toeplitz and Circulant Matrices: A Review

Robert M. Gray.Toeplitz and Circulant Matrices: A Review. Now Publishers Inc., 2006

2006
[15]

Greengard and V

L. Greengard and V . Rokhlin. A fast algorithm for particle simulations.Journal of Computa- tional Physics, 73(2):325–348, 1987

1987
[16]

Accelerating the nonuniform fast Fourier transform.SIAM Review, 46(3):443–454, 2004

Leslie Greengard and June-Yub Lee. Accelerating the nonuniform fast Fourier transform.SIAM Review, 46(3):443–454, 2004

2004
[17]

Song Han, Jeff Pool, John Tran, and William J. Dally. Learning both weights and connections for efficient neural networks. InAdvances in Neural Information Processing Systems (NeurIPS), 2015. 10

2015
[18]

Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen

Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. Lora: Low-rank adaptation of large language models. In International Conference on Learning Representations (ICLR), 2022

2022
[19]

Transformers are RNNs: Fast autoregressive transformers with linear attention

Angelos Katharopoulos, Apoorv Vyas, Nikolaos Pappas, and François Fleuret. Transformers are RNNs: Fast autoregressive transformers with linear attention. InInternational Conference on Machine Learning (ICML), 2020

2020
[20]

Swin Transformer: Hierarchical vision transformer using shifted windows

Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. Swin Transformer: Hierarchical vision transformer using shifted windows. InProc. IEEE/CVF International Conference on Computer Vision (ICCV), pages 10012–10022, 2021

2021
[21]

New insights and perspectives on the natural gradient method.Journal of Machine Learning Research, 21(146):1–76, 2020

James Martens. New insights and perspectives on the natural gradient method.Journal of Machine Learning Research, 21(146):1–76, 2020

2020
[22]

Smith, and Lingpeng Kong

Hao Peng, Nikolaos Pappas, Dani Yogatama, Roy Schwartz, Noah A. Smith, and Lingpeng Kong. Random feature attention. InInternational Conference on Learning Representations (ICLR), 2021

2021
[23]

Random features for large-scale kernel machines

Ali Rahimi and Benjamin Recht. Random features for large-scale kernel machines. InAdvances in Neural Information Processing Systems (NeurIPS), volume 20, pages 1177–1184, 2007

2007
[24]

Carl Edward Rasmussen and Christopher K. I. Williams.Gaussian Processes for Machine Learning. MIT Press, 2006

2006
[25]

On gans and gmms

Eitan Richardson and Yair Weiss. On gans and gmms. InAdvances in Neural Information Processing Systems, volume 31, 2018

2018
[26]

A Simple and Effective Pruning Approach for Large Language Models

Mingjie Sun, Zhuang Liu, Anna Bair, and J. Zico Kolter. A simple and effective pruning approach for large language models. InInternational Conference on Learning Representations (ICLR), 2024. Also arXiv:2306.11695, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2024
[27]

Sutherland and Jeff Schneider

Danica J. Sutherland and Jeff Schneider. On the error of random Fourier features. InProceedings of the Thirty-First Conference on Uncertainty in Artificial Intelligence (UAI), pages 862–871, 2015

2015
[28]

Mixtures of probabilistic principal component analyzers.Neural Computation, 11(2):443–482, 1999

Michael E Tipping and Christopher M Bishop. Mixtures of probabilistic principal component analyzers.Neural Computation, 11(2):443–482, 1999

1999
[29]

Probabilistic principal component analysis

Michael E Tipping and Christopher M Bishop. Probabilistic principal component analysis. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 61(3):611–622, 1999

1999
[30]

Maxvit: Multi-axis vision transformer

Zhengzhong Tu, Hossein Talebi, Han Zhang, Feng Yang, Peyman Milanfar, Alan Bovik, and Yinxiao Li. Maxvit: Multi-axis vision transformer. InEuropean conference on computer vision, pages 459–479. Springer, 2022

2022
[31]

Generalized low rank models

Madeleine Udell, Corinne Horn, Reza Zadeh, and Stephen Boyd. Generalized low rank models. Foundations and Trends in Machine Learning, 9(1):1–118, 2016

2016
[32]

Gomez, Łukasz Kaiser, and Illia Polosukhin

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. InAdvances in Neural Informa- tion Processing Systems, volume 30, 2017

2017
[33]

Christopher K. I. Williams and Matthias Seeger. Using the Nyström method to speed up kernel machines. InAdvances in Neural Information Processing Systems, volume 13, 2001

2001
[34]

Jianlin Xia, Shiv Chandrasekaran, Ming Gu, and Xiaoye S. Li. Fast algorithms for hierarchically semiseparable matrices.Numerical Linear Algebra with Applications, 17(6):953–976, 2010

2010
[35]

differentiable

Daniel Zoran and Yair Weiss. From learning models of natural image patches to whole image restoration. In2011 International Conference on Computer Vision, pages 479–486, 2011. 11 Appendix Guide to the Appendix.The appendix extends the main paper by providing theoretical, algorithmic, and empirical details underlying LAPLEX. Section A establishes expressiv...

2011

[1] [1]

Database-friendly random projections: Johnson–Lindenstrauss with binary coins.Journal of Computer and System Sciences, 66(4):671–687, 2003

Dimitris Achlioptas. Database-friendly random projections: Johnson–Lindenstrauss with binary coins.Journal of Computer and System Sciences, 66(4):671–687, 2003

2003

[2] [2]

Finding frequent items in data streams

Moses Charikar, Kevin Chen, and Martin Farach-Colton. Finding frequent items in data streams. InInternational Colloquium on Automata, Languages, and Programming (ICALP), 2002

2002

[3] [3]

Finding frequent items in data streams

Moses Charikar, Kevin Chen, and Martin Farach-Colton. Finding frequent items in data streams. Theoretical Computer Science, 312(1):3–15, 2004

2004

[4] [4]

Rethinking attention with Performers

Krzysztof Choromanski, Valerii Likhosherstov, David Dohan, Xingyou Song, Andreea Gane, Tamás Sarlós, Peter Hawkins, Jared Davis, Afroz Mohiuddin, Lukasz Kaiser, David Belanger, Lucy Colwell, and Adrian Weller. Rethinking attention with Performers. InInternational Conference on Learning Representations (ICLR), 2021

2021

[5] [5]

Cooley and John W

James W. Cooley and John W. Tukey. An algorithm for the machine calculation of complex fourier series.Mathematics of Computation, 19(90):297–301, 1965

1965

[6] [6]

Muthukrishnan

Graham Cormode and S. Muthukrishnan. An improved data stream summary: The count-min sketch and its applications.Journal of Algorithms, 55(1):58–75, 2005

2005

[7] [7]

Petros Drineas and Michael W. Mahoney. On the nyström method for approximating a gram matrix for improved kernel-based learning.Journal of Machine Learning Research, 6:2153– 2175, 2005

2005

[8] [8]

Dutt and V

A. Dutt and V . Rokhlin. Fast Fourier transforms for nonequispaced data.SIAM Journal on Scientific Computing, 14(6):1368–1393, 1993

1993

[9] [9]

The lottery ticket hypothesis: Finding sparse, trainable neural networks

Jonathan Frankle and Michael Carbin. The lottery ticket hypothesis: Finding sparse, trainable neural networks. InInternational Conference on Learning Representations (ICLR), 2019

2019

[10] [10]

SparseGPT: Massive language models can be accurately pruned in one-shot

Elias Frantar and Dan Alistarh. SparseGPT: Massive language models can be accurately pruned in one-shot. InInternational Conference on Machine Learning (ICML), 2023

2023

[11] [11]

The design and implementation of fftw3.Proceedings of the IEEE, 93(2):216–231, 2005

Matteo Frigo and Steven G Johnson. The design and implementation of fftw3.Proceedings of the IEEE, 93(2):216–231, 2005

2005

[12] [12]

The em algorithm for mixtures of factor analyzers

Zoubin Ghahramani and Geoffrey E Hinton. The em algorithm for mixtures of factor analyzers. CRG-TR-96-1, University of Toronto, 1996

1996

[13] [13]

Revisiting the nyström method for improved large-scale machine learning.Journal of Machine Learning Research, 17(117):1–65, 2016

Alex Gittens and Michael W Mahoney. Revisiting the nyström method for improved large-scale machine learning.Journal of Machine Learning Research, 17(117):1–65, 2016

2016

[14] [14]

Gray.Toeplitz and Circulant Matrices: A Review

Robert M. Gray.Toeplitz and Circulant Matrices: A Review. Now Publishers Inc., 2006

2006

[15] [15]

Greengard and V

L. Greengard and V . Rokhlin. A fast algorithm for particle simulations.Journal of Computa- tional Physics, 73(2):325–348, 1987

1987

[16] [16]

Accelerating the nonuniform fast Fourier transform.SIAM Review, 46(3):443–454, 2004

Leslie Greengard and June-Yub Lee. Accelerating the nonuniform fast Fourier transform.SIAM Review, 46(3):443–454, 2004

2004

[17] [17]

Song Han, Jeff Pool, John Tran, and William J. Dally. Learning both weights and connections for efficient neural networks. InAdvances in Neural Information Processing Systems (NeurIPS), 2015. 10

2015

[18] [18]

Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen

Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. Lora: Low-rank adaptation of large language models. In International Conference on Learning Representations (ICLR), 2022

2022

[19] [19]

Transformers are RNNs: Fast autoregressive transformers with linear attention

Angelos Katharopoulos, Apoorv Vyas, Nikolaos Pappas, and François Fleuret. Transformers are RNNs: Fast autoregressive transformers with linear attention. InInternational Conference on Machine Learning (ICML), 2020

2020

[20] [20]

Swin Transformer: Hierarchical vision transformer using shifted windows

Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. Swin Transformer: Hierarchical vision transformer using shifted windows. InProc. IEEE/CVF International Conference on Computer Vision (ICCV), pages 10012–10022, 2021

2021

[21] [21]

New insights and perspectives on the natural gradient method.Journal of Machine Learning Research, 21(146):1–76, 2020

James Martens. New insights and perspectives on the natural gradient method.Journal of Machine Learning Research, 21(146):1–76, 2020

2020

[22] [22]

Smith, and Lingpeng Kong

Hao Peng, Nikolaos Pappas, Dani Yogatama, Roy Schwartz, Noah A. Smith, and Lingpeng Kong. Random feature attention. InInternational Conference on Learning Representations (ICLR), 2021

2021

[23] [23]

Random features for large-scale kernel machines

Ali Rahimi and Benjamin Recht. Random features for large-scale kernel machines. InAdvances in Neural Information Processing Systems (NeurIPS), volume 20, pages 1177–1184, 2007

2007

[24] [24]

Carl Edward Rasmussen and Christopher K. I. Williams.Gaussian Processes for Machine Learning. MIT Press, 2006

2006

[25] [25]

On gans and gmms

Eitan Richardson and Yair Weiss. On gans and gmms. InAdvances in Neural Information Processing Systems, volume 31, 2018

2018

[26] [26]

A Simple and Effective Pruning Approach for Large Language Models

Mingjie Sun, Zhuang Liu, Anna Bair, and J. Zico Kolter. A simple and effective pruning approach for large language models. InInternational Conference on Learning Representations (ICLR), 2024. Also arXiv:2306.11695, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2024

[27] [27]

Sutherland and Jeff Schneider

Danica J. Sutherland and Jeff Schneider. On the error of random Fourier features. InProceedings of the Thirty-First Conference on Uncertainty in Artificial Intelligence (UAI), pages 862–871, 2015

2015

[28] [28]

Mixtures of probabilistic principal component analyzers.Neural Computation, 11(2):443–482, 1999

Michael E Tipping and Christopher M Bishop. Mixtures of probabilistic principal component analyzers.Neural Computation, 11(2):443–482, 1999

1999

[29] [29]

Probabilistic principal component analysis

Michael E Tipping and Christopher M Bishop. Probabilistic principal component analysis. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 61(3):611–622, 1999

1999

[30] [30]

Maxvit: Multi-axis vision transformer

Zhengzhong Tu, Hossein Talebi, Han Zhang, Feng Yang, Peyman Milanfar, Alan Bovik, and Yinxiao Li. Maxvit: Multi-axis vision transformer. InEuropean conference on computer vision, pages 459–479. Springer, 2022

2022

[31] [31]

Generalized low rank models

Madeleine Udell, Corinne Horn, Reza Zadeh, and Stephen Boyd. Generalized low rank models. Foundations and Trends in Machine Learning, 9(1):1–118, 2016

2016

[32] [32]

Gomez, Łukasz Kaiser, and Illia Polosukhin

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. InAdvances in Neural Informa- tion Processing Systems, volume 30, 2017

2017

[33] [33]

Christopher K. I. Williams and Matthias Seeger. Using the Nyström method to speed up kernel machines. InAdvances in Neural Information Processing Systems, volume 13, 2001

2001

[34] [34]

Jianlin Xia, Shiv Chandrasekaran, Ming Gu, and Xiaoye S. Li. Fast algorithms for hierarchically semiseparable matrices.Numerical Linear Algebra with Applications, 17(6):953–976, 2010

2010

[35] [35]

differentiable

Daniel Zoran and Yair Weiss. From learning models of natural image patches to whole image restoration. In2011 International Conference on Computer Vision, pages 479–486, 2011. 11 Appendix Guide to the Appendix.The appendix extends the main paper by providing theoretical, algorithmic, and empirical details underlying LAPLEX. Section A establishes expressiv...

2011