pith. sign in

arxiv: 2605.24584 · v1 · pith:EXXJOD7Knew · submitted 2026-05-23 · 💻 cs.LG · cs.AI

LAPLEX: The FFT of Learnable Laplace Kernels

Pith reviewed 2026-06-30 13:57 UTC · model grok-4.3

classification 💻 cs.LG cs.AI
keywords LAPLEXlearnable Laplace kernelsFFT scalingdense matrix operatorskernel methodshigh-dimensional learningtrainable linear layers
0
0 comments X

The pith

LAPLEX defines full-rank dense matrices via learnable Laplace kernels that scale like the FFT.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces LAPLEX as a class of exact trainable phased Laplace-kernel operators. It seeks to overcome the usual choice between fixed exact transforms like the Fourier transform and adaptive but costly or approximate dense layers. By anchoring the kernels at learnable coordinates, LAPLEX produces an implicit dense matrix that remains exactly computable through FFT-like operations. This supports matrix-vector multiplies at dimensions up to 10^9 while keeping the parameter count small. The result is that data-adaptive global linear interactions become feasible in regimes where ordinary dense layers cannot be stored or applied.

Core claim

A LAPLEX layer is a typically full-rank dense matrix, implicitly defined by learnable coordinate anchors, with FFT-like scaling. Consequently, it supports trainable matrix-vector operations at vector dimensions up to 10^9 on modern GPUs. As a neural layer it yields compact projections and classification heads interpretable as soft trainable routing models; the same primitive also serves as an efficient Gram operator for high-dimensional covariance models on flattened images of dimension 3*10^6 that preserve visible spatial structure without convolutional bias.

What carries the argument

The phased Laplace kernel operator implicitly defined by learnable coordinate anchors, which produces a full-rank dense matrix yet admits exact FFT-like application.

If this is right

  • Trainable matrix-vector operations become feasible at vector dimensions up to 10^9 on modern GPUs.
  • Compact projections and classification heads can be interpreted as soft trainable routing models.
  • High-dimensional covariance models can be fit to flattened images of dimension 3*10^6 while preserving visible spatial structure without convolutional bias.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same anchor-based construction might be tried with other kernel families to obtain analogous FFT-style operators.
  • Applications could extend beyond images to other high-dimensional structured data such as long sequences or point clouds.
  • The method's practical scaling limit could be tested by running matrix-vector products at dimensions between 10^8 and 10^9 on current hardware.

Load-bearing premise

A phased Laplace kernel defined by learnable coordinate anchors remains exactly computable via FFT-like operations while preserving full-rank dense behavior.

What would settle it

Form the explicit dense matrix for a vector dimension of a few thousand using the same anchors, compute its matrix-vector product directly, and compare the result to the LAPLEX FFT-based output; any discrepancy larger than floating-point error would falsify the exactness claim.

Figures

Figures reproduced from arXiv: 2605.24584 by Hanna Blazhko, Jacek Tabor, {\L}ukasz Struski, Piotr Kubaty.

Figure 1
Figure 1. Figure 1: Runtime of LAPLEX, Dense, and torch.fft (cuFFT) versus the number of elements n (log2 scale) on NVIDIA H100 (solid) and RTX 2080 (dashed). Shaded regions show standard deviation. LAPLEX scales effi￾ciently to large n, cuFFT is consistently fastest, while Dense fails at relatively small sizes. Our idea: index-level operators that can learn their input geometry. We pursue operators that retain the memory adv… view at source ↗
Figure 2
Figure 2. Figure 2: Illustration of Theorem 1: exact decomposition of the general Laplace kernel matrix–vector [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Forward time (left), forward+backward time (middle), and peak GPU memory of the forward call (right) for XK(a, b) at batch size B = 8. LAPLEX (blue) scales smoothly to n = 220. The explicit dense baseline (orange) runs out of memory above n = 214 (dashed vertical line). At the largest size where both run, LAPLEX is ∼ 65× faster on the forward pass, ∼ 430× faster on forward+backward, and uses ∼100× less pea… view at source ↗
Figure 4
Figure 4. Figure 4: DIV2K density at n = 3,145,728. Matched ∼ 14M-parameter budget: LAPLEX (rank 1000) vs. dense low-rank (rank 4). Top: input and MAP reconstruction; bottom: uncon￾ditional samples with validation log-likelihood. LAPLEX dominates the dense baseline by a 4.5× factor in test LL. Setup. We use a low-rank-plus-diagonal Gaus￾sian template x = m + F z + d ⊙ ε, with z ∼ N (0, Ik) and ε ∼ N (0, In). At a matched para… view at source ↗
Figure 5
Figure 5. Figure 5: GPU runtime for a single matrix–vector prod [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: RFF vs. LAPLEX. (Left) Runtime as a function of D. RFF is linear in the feature count; LAPLEX is independent of D (∼ 0.5 ms, dashed). The runtime crossover occurs near D ≈ 16. (Right) Relative ℓ2 approximation error of RFF (median and [min, max] band over 10 seeds) with the 1/ √ D reference. At the runtime crossover RFF still has ∼72% median error; below 10% error requires D ≥ 1024, where RFF is ∼75× slowe… view at source ↗
Figure 7
Figure 7. Figure 7: Structure of the exact Gram matrix M = A diag(D)A⊤ for the Laplace kernel. Each entry Mij (shown for i ≥ j) admits a decomposition into three one-dimensional aggregated contributions: a left prefix term Uj capturing accumulated influence from indices up to aj , an interval mass (Wi−Wj ) corresponding to a prefix-suffix decomposition over the interval [aj , ai ], and a right suffix term Vi encoding contribu… view at source ↗
Figure 8
Figure 8. Figure 8: GPU benchmark of XK(a, b) as a function of n = k at B = 8. Forward time (left), forward+backward time (middle), and peak GPU memory of the forward call (right). LAPLEX is shown in blue; the explicit-dense baseline is shown in orange and runs out of memory above n = 214 (dashed vertical line). At n = 214 the forward speed-up is ∼65×, the forward+backward speed-up exceeds ∼ 430×, and the peak memory reductio… view at source ↗
Figure 9
Figure 9. Figure 9: Relative error against an float64 dense reference. Left: ℓ2 relative error. Right: ℓ∞ relative error. The dashed line marks the float32 machine epsilon ε ≈ 1.19 · 10−7 . LAPLEX sits at the unit-of-last-place noise floor across the entire range, while the explicit dense float32 matmul accumulates error proportional to n, exceeding LAPLEX by a factor of 2–5× at n = 214 . D.3 Batch-size scaling Setup. We fix … view at source ↗
Figure 10
Figure 10. Figure 10: Batch-size scaling at n = k = 214 . Forward (left) and forward+backward (right). The dense baseline is nearly B-independent at moderate B because its cost is dominated by building K (independent of B); only at the very largest B does the actual matmul become the bottleneck. LAPLEX grows linearly with B once B ≳ 8, with a small constant offset for kernel-launch overhead. The forward+backward column is empt… view at source ↗
Figure 11
Figure 11. Figure 11: Asymmetric kernels. Left: n = 1024 fixed, k varies (branch A). Right: k = 1024 fixed, n varies (branch B). LAPLEX is essentially constant in the swept axis up to ∼2 16 (cost is dominated by the fixed-axis scan), then grows linearly with the long axis. The dense baseline grows linearly throughout and OOMs above ∼2 18 . Observations. With n = 1024 fixed, LAPLEX’s forward stays under 1.1 ms up to k = 217 and… view at source ↗
Figure 12
Figure 12. Figure 12: Weighted Gram matrix scaling. Left: forward time. Right: peak GPU memory. n = 1024 is fixed; k varies. Up to k ≈ 2 17 the cost of LAPLEX is dominated by the O(n 2 ) output construction and is therefore independent of k. The dense baseline grows linearly in k in both axes and OOMs above k = 217 (the dense K matrix alone exceeds 2 GB at k = 219). Observations. The flat plateau in the forward-time panel is t… view at source ↗
Figure 13
Figure 13. Figure 13: CPU benchmark. Pure-PyTorch LAPLEX versus explicit dense, both running on CPU. The pure-PyTorch LAPLEX scales close to linearly in n (logcumsumexp is well-vectorised in PyTorch’s CPU backend) while the dense baseline grows quadratically. At n = 214 the LAPLEX forward is 108× faster than the explicit dense. Observations. The pure-PyTorch path is intentionally several times slower than the custom CUDA backe… view at source ↗
Figure 14
Figure 14. Figure 14: FFT comparison, float32. (a) GPU runtime for a single matrix–vector product on irregular anchors (B = 1, n = k, a, b ∼ N (0, 1)): LAPLEX (blue), Toeplitz–FFT on its fixed-grid specialization (orange), and a single cuFFT call as a hardware lower bound (green). All three curves follow the O(n log n) reference. (b) Relative ℓ2 error against a float64 dense reference on the uniform-grid Toeplitz problem where… view at source ↗
Figure 15
Figure 15. Figure 15: Runtime and peak memory usage of LAPLEX, Dense, and torch.fft.fft (cuFFT backend) as a function of the number of elements n (log2 scale), evaluated on an NVIDIA H100 (solid lines) and an RTX 2080 (dashed lines). Shaded regions denote the standard deviation across runs. Left: LAPLEX achieves near-constant runtime of ≈0.34 ms on the H100 for n ≤ 2 19, scaling to ≈704 ms at n = 229. The torch.fft.fft baselin… view at source ↗
Figure 16
Figure 16. Figure 16: Numerical accuracy on a uniform-grid Toeplitz problem. Relative ℓ2 (left) and ℓ∞ (right) error against an float64 dense reference. All three float32 routes operate on the same input tensors. Toeplitz–FFT improves on the dense float32 matmul by a factor of ∼1.5–2×; LAPLEX improves on Toeplitz–FFT by another ∼1.5×, despite making no use of the uniformity of the grid. The dashed line marks the float32 machin… view at source ↗
Figure 17
Figure 17. Figure 17: CPU runtime. All three methods on a single CPU thread, log–log axes. The dashed line is an O(n log n) reference fitted to the last torch.fft.fft measurement. All curves are parallel to that reference, confirming that the matching-complexity result of the GPU experiment is hardware￾independent. Observations. At n = 218 the wall-clock costs are LAPLEX ≈ 64 ms, Toeplitz–FFT≈ 6 ms, torch.fft.fft≈0.65 ms. The … view at source ↗
Figure 18
Figure 18. Figure 18: RFF gradient accuracy on the Laplace kernel. n = k = 216, median (line) with [min, max] range (shaded) over 10 seeds. Left: relative error of the value, ∇aL, and ∇bL versus D; the dashed line is the O(1/ √ D) reference established for the value. The value error tracks the reference, the a-gradient lags, and the b-gradient does not measurably improve with D. Right: worst-case-to-median error ratio across s… view at source ↗
Figure 19
Figure 19. Figure 19: further illustrates training dynamics. Across all ranks and both backbones, LAPLEX(r) ex￾hibits faster convergence and consistently lower training and validation loss compared to LowRank(r) , reaching equal or better final values [PITH_FULL_IMAGE:figures/full_fig_p029_19.png] view at source ↗
Figure 20
Figure 20. Figure 20: ImageNette-320 density: random samples drawn from four Gaussian models, all con￾ditioned on the same Welford-frozen 307,200-vector mean m, ordered left to right by trainable budget and (equivalently for LAPLEX) by component count. Left: full-empirical-Gaussian ceil￾ing Σε=Σ +ˆ εI with val-optimal ε=10−2 (∼ 2.9 B effective parameters). Centre-left: LAPLEX single-component (I=1, ∼2.2M trainable). Centre-rig… view at source ↗
Figure 21
Figure 21. Figure 21: LAPLEX MAP reconstructions on ten held-out ImageNette-320 images, one uniformly￾random sample from each of the ten classes (tench, English springer, cassette player, chain saw, church, French horn, garbage truck, gas pump, golf ball, parachute; seed 1234). Row 1: the original 320×320 inputs (no preprocessing beyond resize-and-center-crop, flattened to 307,200-vectors before hitting the model), labelled wi… view at source ↗
read the original abstract

Fast linear algebra in deep learning usually comes with a choice: fixed geometry and exact computation, as in the Fourier transform, or adaptive geometry paid for by dense parameters, random features, or low-rank surrogates. To move beyond this trade-off, we introduce LAPLEX, a class of exact, trainable (phased) Laplace-kernel operators. A LAPLEX layer is a typically full-rank dense matrix, implicitly defined by learnable coordinate anchors, with FFT-like scaling. Consequently, it supports trainable matrix--vector operations at vector dimensions up to $10^9$ on modern GPUs. As a neural layer, it yields compact projections and classification heads interpretable as soft, trainable routing models. The same primitive also serves as an efficient Gram operator, enabling high-dimensional covariance models on flattened images of dimension $3 \cdot 10^6$ that preserve visible spatial structure without imposing convolutional bias. These applications reflect a single principle: dense geometry can be learned without storing a dense matrix, which enables data-adaptive global interactions in regimes where ordinary dense layers are out of reach. In this sense, LAPLEX separates expressivity from storage cost: it behaves like a dense trainable matrix, but is represented and applied through a small structured set of parameters.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 0 minor

Summary. The paper introduces LAPLEX as a class of exact, trainable phased Laplace-kernel operators. A LAPLEX layer is described as a typically full-rank dense matrix implicitly defined by learnable coordinate anchors, achieving FFT-like scaling to support trainable matrix-vector operations at dimensions up to 10^9. Applications include neural network layers for compact projections and classification heads, as well as efficient Gram operators for high-dimensional covariance models on flattened images while preserving spatial structure without convolutional bias. The core principle is that dense geometry can be learned without storing a dense matrix.

Significance. If the central claims of exactness, full-rank behavior, and FFT-like scaling hold, the work would enable data-adaptive global interactions in deep learning at scales where ordinary dense layers are infeasible, separating expressivity from storage cost in a manner distinct from fixed-geometry transforms or low-rank approximations.

major comments (1)
  1. [Abstract] Abstract: The claims that the operator is 'exact' and supports 'FFT-like scaling' while remaining a 'typically full-rank dense matrix' implicitly defined by learnable coordinate anchors are asserted without any derivation of the phased Laplace kernel, definition of the FFT-like algorithm, error analysis, rank verification, or proof of exactness. This is load-bearing for the central claim, as the entire contribution rests on these properties.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for highlighting the need for explicit support of the core claims. The abstract is a high-level summary; the manuscript provides the requested derivations, definitions, analyses, and proofs in the main body. We address the comment below and will make a targeted revision to the abstract.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The claims that the operator is 'exact' and supports 'FFT-like scaling' while remaining a 'typically full-rank dense matrix' implicitly defined by learnable coordinate anchors are asserted without any derivation of the phased Laplace kernel, definition of the FFT-like algorithm, error analysis, rank verification, or proof of exactness. This is load-bearing for the central claim, as the entire contribution rests on these properties.

    Authors: The abstract summarizes results whose supporting material appears in the body. Section 2 derives the phased Laplace kernel from first principles and shows how learnable anchors define the operator. Section 3 defines the FFT-like algorithm, proves its exact equivalence to the dense kernel matrix-vector product, and gives the O(n log n) complexity. Section 4 supplies the error analysis (including floating-point and approximation bounds), a rank theorem establishing that the operator is typically full rank for generic anchor placements, and numerical verification across dimensions up to 10^9. We will revise the abstract to add one sentence directing readers to these sections for the derivations and proofs. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

The abstract introduces LAPLEX as an implicitly defined operator via learnable anchors with FFT-like scaling, but supplies no equations, derivations, or self-citations that reduce any claimed prediction or uniqueness result to its own inputs by construction. No load-bearing steps match the enumerated patterns of self-definition, fitted-input renaming, or ansatz smuggling; the central claim of separating expressivity from storage remains independent of any internal fit or prior author result within the provided text. This is the common honest outcome for papers whose core construction is presented without visible reduction to tautology.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

Review performed on abstract only; free parameters consist of the learnable coordinate anchors whose values are fitted during training. The applicability of FFT to the phased Laplace kernel is treated as a domain assumption without further justification visible in the abstract.

free parameters (1)
  • coordinate anchors
    Learnable parameters that implicitly define the dense matrix; their number and initialization are not specified in the abstract.
axioms (1)
  • domain assumption Phased Laplace kernel admits exact FFT-like fast multiplication
    Invoked to achieve FFT-like scaling while remaining exact and trainable.

pith-pipeline@v0.9.1-grok · 5760 in / 1269 out tokens · 46194 ms · 2026-06-30T13:57:43.786511+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

35 extracted references · 1 canonical work pages · 1 internal anchor

  1. [1]

    Database-friendly random projections: Johnson–Lindenstrauss with binary coins.Journal of Computer and System Sciences, 66(4):671–687, 2003

    Dimitris Achlioptas. Database-friendly random projections: Johnson–Lindenstrauss with binary coins.Journal of Computer and System Sciences, 66(4):671–687, 2003

  2. [2]

    Finding frequent items in data streams

    Moses Charikar, Kevin Chen, and Martin Farach-Colton. Finding frequent items in data streams. InInternational Colloquium on Automata, Languages, and Programming (ICALP), 2002

  3. [3]

    Finding frequent items in data streams

    Moses Charikar, Kevin Chen, and Martin Farach-Colton. Finding frequent items in data streams. Theoretical Computer Science, 312(1):3–15, 2004

  4. [4]

    Rethinking attention with Performers

    Krzysztof Choromanski, Valerii Likhosherstov, David Dohan, Xingyou Song, Andreea Gane, Tamás Sarlós, Peter Hawkins, Jared Davis, Afroz Mohiuddin, Lukasz Kaiser, David Belanger, Lucy Colwell, and Adrian Weller. Rethinking attention with Performers. InInternational Conference on Learning Representations (ICLR), 2021

  5. [5]

    Cooley and John W

    James W. Cooley and John W. Tukey. An algorithm for the machine calculation of complex fourier series.Mathematics of Computation, 19(90):297–301, 1965

  6. [6]

    Muthukrishnan

    Graham Cormode and S. Muthukrishnan. An improved data stream summary: The count-min sketch and its applications.Journal of Algorithms, 55(1):58–75, 2005

  7. [7]

    Petros Drineas and Michael W. Mahoney. On the nyström method for approximating a gram matrix for improved kernel-based learning.Journal of Machine Learning Research, 6:2153– 2175, 2005

  8. [8]

    Dutt and V

    A. Dutt and V . Rokhlin. Fast Fourier transforms for nonequispaced data.SIAM Journal on Scientific Computing, 14(6):1368–1393, 1993

  9. [9]

    The lottery ticket hypothesis: Finding sparse, trainable neural networks

    Jonathan Frankle and Michael Carbin. The lottery ticket hypothesis: Finding sparse, trainable neural networks. InInternational Conference on Learning Representations (ICLR), 2019

  10. [10]

    SparseGPT: Massive language models can be accurately pruned in one-shot

    Elias Frantar and Dan Alistarh. SparseGPT: Massive language models can be accurately pruned in one-shot. InInternational Conference on Machine Learning (ICML), 2023

  11. [11]

    The design and implementation of fftw3.Proceedings of the IEEE, 93(2):216–231, 2005

    Matteo Frigo and Steven G Johnson. The design and implementation of fftw3.Proceedings of the IEEE, 93(2):216–231, 2005

  12. [12]

    The em algorithm for mixtures of factor analyzers

    Zoubin Ghahramani and Geoffrey E Hinton. The em algorithm for mixtures of factor analyzers. CRG-TR-96-1, University of Toronto, 1996

  13. [13]

    Revisiting the nyström method for improved large-scale machine learning.Journal of Machine Learning Research, 17(117):1–65, 2016

    Alex Gittens and Michael W Mahoney. Revisiting the nyström method for improved large-scale machine learning.Journal of Machine Learning Research, 17(117):1–65, 2016

  14. [14]

    Gray.Toeplitz and Circulant Matrices: A Review

    Robert M. Gray.Toeplitz and Circulant Matrices: A Review. Now Publishers Inc., 2006

  15. [15]

    Greengard and V

    L. Greengard and V . Rokhlin. A fast algorithm for particle simulations.Journal of Computa- tional Physics, 73(2):325–348, 1987

  16. [16]

    Accelerating the nonuniform fast Fourier transform.SIAM Review, 46(3):443–454, 2004

    Leslie Greengard and June-Yub Lee. Accelerating the nonuniform fast Fourier transform.SIAM Review, 46(3):443–454, 2004

  17. [17]

    Song Han, Jeff Pool, John Tran, and William J. Dally. Learning both weights and connections for efficient neural networks. InAdvances in Neural Information Processing Systems (NeurIPS), 2015. 10

  18. [18]

    Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen

    Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. Lora: Low-rank adaptation of large language models. In International Conference on Learning Representations (ICLR), 2022

  19. [19]

    Transformers are RNNs: Fast autoregressive transformers with linear attention

    Angelos Katharopoulos, Apoorv Vyas, Nikolaos Pappas, and François Fleuret. Transformers are RNNs: Fast autoregressive transformers with linear attention. InInternational Conference on Machine Learning (ICML), 2020

  20. [20]

    Swin Transformer: Hierarchical vision transformer using shifted windows

    Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. Swin Transformer: Hierarchical vision transformer using shifted windows. InProc. IEEE/CVF International Conference on Computer Vision (ICCV), pages 10012–10022, 2021

  21. [21]

    New insights and perspectives on the natural gradient method.Journal of Machine Learning Research, 21(146):1–76, 2020

    James Martens. New insights and perspectives on the natural gradient method.Journal of Machine Learning Research, 21(146):1–76, 2020

  22. [22]

    Smith, and Lingpeng Kong

    Hao Peng, Nikolaos Pappas, Dani Yogatama, Roy Schwartz, Noah A. Smith, and Lingpeng Kong. Random feature attention. InInternational Conference on Learning Representations (ICLR), 2021

  23. [23]

    Random features for large-scale kernel machines

    Ali Rahimi and Benjamin Recht. Random features for large-scale kernel machines. InAdvances in Neural Information Processing Systems (NeurIPS), volume 20, pages 1177–1184, 2007

  24. [24]

    Carl Edward Rasmussen and Christopher K. I. Williams.Gaussian Processes for Machine Learning. MIT Press, 2006

  25. [25]

    On gans and gmms

    Eitan Richardson and Yair Weiss. On gans and gmms. InAdvances in Neural Information Processing Systems, volume 31, 2018

  26. [26]

    A Simple and Effective Pruning Approach for Large Language Models

    Mingjie Sun, Zhuang Liu, Anna Bair, and J. Zico Kolter. A simple and effective pruning approach for large language models. InInternational Conference on Learning Representations (ICLR), 2024. Also arXiv:2306.11695, 2023

  27. [27]

    Sutherland and Jeff Schneider

    Danica J. Sutherland and Jeff Schneider. On the error of random Fourier features. InProceedings of the Thirty-First Conference on Uncertainty in Artificial Intelligence (UAI), pages 862–871, 2015

  28. [28]

    Mixtures of probabilistic principal component analyzers.Neural Computation, 11(2):443–482, 1999

    Michael E Tipping and Christopher M Bishop. Mixtures of probabilistic principal component analyzers.Neural Computation, 11(2):443–482, 1999

  29. [29]

    Probabilistic principal component analysis

    Michael E Tipping and Christopher M Bishop. Probabilistic principal component analysis. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 61(3):611–622, 1999

  30. [30]

    Maxvit: Multi-axis vision transformer

    Zhengzhong Tu, Hossein Talebi, Han Zhang, Feng Yang, Peyman Milanfar, Alan Bovik, and Yinxiao Li. Maxvit: Multi-axis vision transformer. InEuropean conference on computer vision, pages 459–479. Springer, 2022

  31. [31]

    Generalized low rank models

    Madeleine Udell, Corinne Horn, Reza Zadeh, and Stephen Boyd. Generalized low rank models. Foundations and Trends in Machine Learning, 9(1):1–118, 2016

  32. [32]

    Gomez, Łukasz Kaiser, and Illia Polosukhin

    Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. InAdvances in Neural Informa- tion Processing Systems, volume 30, 2017

  33. [33]

    Christopher K. I. Williams and Matthias Seeger. Using the Nyström method to speed up kernel machines. InAdvances in Neural Information Processing Systems, volume 13, 2001

  34. [34]

    Jianlin Xia, Shiv Chandrasekaran, Ming Gu, and Xiaoye S. Li. Fast algorithms for hierarchically semiseparable matrices.Numerical Linear Algebra with Applications, 17(6):953–976, 2010

  35. [35]

    differentiable

    Daniel Zoran and Yair Weiss. From learning models of natural image patches to whole image restoration. In2011 International Conference on Computer Vision, pages 479–486, 2011. 11 Appendix Guide to the Appendix.The appendix extends the main paper by providing theoretical, algorithmic, and empirical details underlying LAPLEX. Section A establishes expressiv...