pith. sign in

arxiv: 2605.20635 · v1 · pith:QERS5T3Unew · submitted 2026-05-20 · 💻 cs.LG · math.ST· stat.ML· stat.TH

The General Theory of Localization Methods

Pith reviewed 2026-05-21 06:30 UTC · model grok-4.3

classification 💻 cs.LG math.STstat.MLstat.TH
keywords localization methodself-attentionTransformerkernel methodsHopfield networksunified frameworkmachine learning modelshierarchical models
0
0 comments X

The pith

The localization method unifies many machine learning models and reconstructs the Transformer from hierarchical local models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper introduces the localization method as a general framework for machine learning built on localization kernels and local means that support self-attention. It defines the framework through the formulation of the local model and the localization trick to connect rigorously to kernel methods, lazy learning, MeanShift, relaxation labeling, Hopfield networks, local linear embedding, fuzzy inference, and denoising autoencoders. The central result shows that Transformers can be assembled from hierarchical local models, which positions the method as a way to reinterpret existing techniques and to generate new adaptive architectures.

Core claim

Defining the local model and applying the localization trick yields a framework that reinterprets kernel methods, lazy learning, MeanShift, relaxation labeling, Hopfield networks, local linear embedding, fuzzy inference, and denoising autoencoders as instances of localization while extending the same structure to adaptive kernels, hierarchical local models, and the construction of Transformers.

What carries the argument

The localization trick, which weights local means by localization kernels to produce localized models that generalize self-attention.

If this is right

  • Hopfield networks and denoising autoencoders become special cases of localized models.
  • Transformers arise directly from stacking hierarchical local models.
  • Adaptive kernels and non-local extensions fit inside the same formal structure.
  • New data-adaptive systems can be designed by varying the choice of localization kernel.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same local-means construction might extend naturally to graph-structured or spatial data beyond sequences.
  • Focusing computation on local kernels could suggest efficiency gains for attention-based models on very long inputs.
  • Links to fuzzy inference might allow the framework to produce more interpretable decision rules in classification tasks.

Load-bearing premise

The local model and localization trick together suffice to connect and generalize the listed existing models without omitting their essential behaviors.

What would settle it

A step-by-step derivation that recovers the exact standard Transformer attention equations from repeated applications of the localization trick on hierarchical local models would confirm the claim; failure to recover those equations exactly would refute it.

read the original abstract

This paper proposes a general machine learning framework called the localization method, which is fundamentally built on two core concepts: localization kernels and local means -- key components that underpin the self-attention mechanism. To establish a rigorous theoretical foundation, the framework is formally defined through two essential pillars: the formulation of the local(-ized) model and the localization trick. We systematically investigate the connections between the localization method and a wide range of existing machine learning models/methods, including (but not limited to) kernel methods, lazy learning, the MeanShift algorithm, relaxation labeling, Hopfield networks, local linear embedding (LLE), fuzzy inference, and denoising autoencoders (DAEs). By dissecting these relationships, we clarify the broader theoretical significance of the localization method and demonstrate its practical applicability across diverse machine learning tasks. Furthermore, we explore advanced extensions of the framework, such as adaptive kernels, hierarchical local models, and non-local models. Notably, we show that the Transformer -- a cornerstone of modern sequence modeling -- can be constructed using hierarchical local models, revealing the ability of the localization method to unify and generalize state-of-the-art architectures. This work not only provides a unified theoretical lens to reinterpret existing models but also offers new methodological tools for designing flexible, data-adaptive learning systems.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces the localization method as a general framework for machine learning, founded on localization kernels and local means. It establishes the framework through the local model and the localization trick, explores connections to kernel methods, lazy learning, MeanShift, relaxation labeling, Hopfield networks, LLE, fuzzy inference, and denoising autoencoders. It further develops extensions including adaptive kernels, hierarchical local models, and non-local models, and demonstrates that the Transformer architecture can be derived from hierarchical local models.

Significance. Should the connections and the Transformer construction be rigorously derived without circularity or ad-hoc choices, this work could provide a valuable unifying theory for a broad range of machine learning models and architectures. The ability to reconstruct state-of-the-art models like Transformers from the localization framework would strengthen the case for its generality and offer new insights into designing adaptive learning systems.

major comments (2)
  1. [Section on hierarchical local models and Transformer construction] The construction of the Transformer using hierarchical local models must be shown to exactly reproduce the standard self-attention mechanism, including the query, key, value projections and the scaling factor. If the localization trick or kernels are defined in a way that incorporates these elements by construction, the unification claim requires clarification to avoid circularity.
  2. [Formulation of the local model and localization trick] The two pillars (local model formulation and localization trick) are presented as establishing a rigorous foundation, but it is unclear whether the definitions are independent of the models they aim to unify or if they are tailored to fit the connections. A concrete example deriving one of the listed models (e.g., Hopfield networks or MeanShift) from the core definitions would help assess if the framework is generative or descriptive.
minor comments (2)
  1. [Abstract] The abstract lists many connections but does not specify which are novel derivations versus reinterpretations; clarifying this would improve the presentation.
  2. [Notation] Ensure consistent use of notation for localization kernels and local means throughout the paper to avoid confusion with standard kernel methods.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments and the recommendation for major revision. We address the concerns regarding the rigor of the Transformer derivation and the independence of the core definitions below, and have revised the manuscript accordingly to provide explicit derivations and examples.

read point-by-point responses
  1. Referee: [Section on hierarchical local models and Transformer construction] The construction of the Transformer using hierarchical local models must be shown to exactly reproduce the standard self-attention mechanism, including the query, key, value projections and the scaling factor. If the localization trick or kernels are defined in a way that incorporates these elements by construction, the unification claim requires clarification to avoid circularity.

    Authors: We agree that an explicit, step-by-step derivation is required to substantiate the claim. In the revised manuscript we have expanded the relevant section to derive the standard self-attention mechanism directly from the hierarchical local-model construction. The query, key, and value projections arise from the specific choice of local models at each level of the hierarchy, while the scaling factor follows from the normalization inherent in the localization kernel; neither is presupposed in the general definitions. The core localization kernels and local means are introduced independently of any target architecture, and the Transformer is obtained as one particular hierarchical specialization. We have added a clarifying paragraph stating that the framework is therefore not circular. revision: yes

  2. Referee: [Formulation of the local model and localization trick] The two pillars (local model formulation and localization trick) are presented as establishing a rigorous foundation, but it is unclear whether the definitions are independent of the models they aim to unify or if they are tailored to fit the connections. A concrete example deriving one of the listed models (e.g., Hopfield networks or MeanShift) from the core definitions would help assess if the framework is generative or descriptive.

    Authors: To demonstrate that the framework is generative, the revised manuscript now contains a fully worked derivation of the MeanShift algorithm starting from the general local-model formulation and localization trick. The derivation begins with the abstract definitions of localization kernels and local means and arrives at the standard MeanShift update without additional assumptions. A shorter outline deriving the Hopfield network is also supplied. These examples illustrate that the two pillars are formulated from first principles of localization and are not tailored to the models they later recover. revision: yes

Circularity Check

0 steps flagged

No significant circularity; framework definitions enable explicit mappings to prior models without reducing to inputs by construction.

full rationale

The paper introduces the localization method via two pillars (local model formulation and localization trick) that are defined independently of the specific models they later connect to. It then derives connections to kernel methods, MeanShift, Hopfield networks, LLE, and others through explicit reformulations, and constructs the Transformer as a special case of hierarchical local models. These steps rely on general definitions of kernels and local means rather than fitting parameters to target outputs or invoking self-citations for uniqueness. No equation or claim reduces to a tautology or renamed input; the unification follows from applying the core concepts to each architecture. The derivation remains self-contained against external benchmarks like standard self-attention equations.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 2 invented entities

Limited to abstract content; the paper introduces new core concepts without mentioning fitted parameters or external benchmarks. The definitional pillars serve as the primary axioms, and localization kernels/local means function as invented entities central to the claim.

axioms (1)
  • domain assumption The localization method is formally defined through the formulation of the local(-ized) model and the localization trick.
    Stated in the abstract as the two essential pillars establishing the rigorous theoretical foundation.
invented entities (2)
  • localization kernels no independent evidence
    purpose: Core component that underpins the self-attention mechanism and the localization method.
    Introduced as one of the two fundamental building blocks of the framework.
  • local means no independent evidence
    purpose: Key component that underpins the self-attention mechanism and the localization method.
    Introduced as one of the two fundamental building blocks of the framework.

pith-pipeline@v0.9.0 · 5749 in / 1574 out tokens · 50333 ms · 2026-05-21T06:30:09.535066+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

151 extracted references · 151 canonical work pages · 5 internal anchors

  1. [1]

    Saliency detection using maximum symmetric surround

    Radhakrishna Achanta and Sabine S¨ usstrunk. Saliency detection using maximum symmetric surround. In2010 IEEE International Conference on Image Processing, pages 2653–2656. IEEE, 2010

  2. [2]

    Instance-based learning algorithms.Machine learning, 6:37–66, 1991

    David W Aha, Dennis Kibler, and Marc K Albert. Instance-based learning algorithms.Machine learning, 6:37–66, 1991

  3. [3]

    Ainslie, J

    J. Ainslie, J. Lee Thorp, M. De Jong, Y. Zemlyanskiy, F. Lebr´ on, and S. Sanghai. Gqa: Training generalized multi-query transformer models from multi-head checkpoints, 2023

  4. [4]

    Gsns: generative stochastic networks.Information and Inference: A Journal of the IMA, 5(2):210– 249, 2016

    Guillaume Alain, Yoshua Bengio, Li Yao, Jason Yosinski, Eric Thibodeau- Laufer, Saizheng Zhang, and Pascal Vincent. Gsns: generative stochastic networks.Information and Inference: A Journal of the IMA, 5(2):210– 249, 2016

  5. [5]

    Hebbian learning from first principles.Journal of Mathe- matical Physics, 65(11):113302, 2024

    Linda Albanese, Adriano Barra, Pierluigi Bianco, Fabrizio Durante, and Diego Pallara. Hebbian learning from first principles.Journal of Mathe- matical Physics, 65(11):113302, 2024

  6. [6]

    The mean shift algorithm and its relation to kernel regression.Information Sciences, 348:198–208, 2016

    Youness Aliyari Ghassabeh and Frank Rudzicz. The mean shift algorithm and its relation to kernel regression.Information Sciences, 348:198–208, 2016

  7. [7]

    Amid and M

    Ehsan Amid and Manfred K Warmuth. Trimap: Large-scale dimension- ality reduction using triplets.arXiv preprint arXiv:1910.00204, 2019

  8. [8]

    Springer, 2006

    Gilles Aubert, Pierre Kornprobst, and Giles Aubert.Mathematical prob- lems in image processing: partial differential equations and the calculus of variations, volume 147. Springer, 2006

  9. [9]

    Analytic-dpm: an analytic estimate of the optimal reverse variance in diffusion probabilistic models

    Fan Bao, Chongxuan Li, Jun Zhu, and Bo Zhang. Analytic-dpm: an analytic estimate of the optimal reverse variance in diffusion probabilistic models. 2022

  10. [10]

    Laplacian eigenmaps and spectral techniques for embedding and clustering

    Mikhail Belkin and Partha Niyogi. Laplacian eigenmaps and spectral techniques for embedding and clustering. In T. Dietterich, S. Becker, and Z. Ghahramani, editors,Advances in Neural Information Processing Systems, volume 14. MIT Press, 2001

  11. [11]

    P. K. Bhattacharya. Estimation of a probability density function and its derivatives.Sankhy¯ a: The Indian Journal of Statistics, Series A (1961- 2002), 29(4):373–382, 1967

  12. [12]

    Lazy learning for local modelling and control design.International Journal of Control, 72(7-8):643–658, 1999

    Gianluca Bontempi, Mauro Birattari, and Hugues Bersini. Lazy learning for local modelling and control design.International Journal of Control, 72(7-8):643–658, 1999. 64

  13. [13]

    Nonparametric density estimation via diffusion mixing

    Zdravko Botev. Nonparametric density estimation via diffusion mixing. Technical report, University of Queensland, 2007

  14. [14]

    Breunig, Hans-Peter Kriegel, Raymond T

    Markus M. Breunig, Hans-Peter Kriegel, Raymond T. Ng, and J¨ org Sander. Lof: identifying density-based local outliers. InProceedings of the 2000 ACM SIGMOD International Conference on Management of Data, SIGMOD ’00, page 93–104, New York, NY, USA, 2000. Association for Computing Machinery

  15. [15]

    A review of image denoising algorithms, with a new one.Multiscale Model

    Antoni Buades, Bartomeu Coll, and Jean-Michel Morel. A review of image denoising algorithms, with a new one.Multiscale Model. Simul., 4:490– 530, 2005

  16. [16]

    Neighborhood filters and pde’s.Numerische Mathematik, 105:1–34, 2006

    Antoni Buades, Bartomeu Coll, and Jean-Michel Morel. Neighborhood filters and pde’s.Numerische Mathematik, 105:1–34, 2006

  17. [17]

    Vici- nal risk minimization.Advances in neural information processing systems, 13, 2000

    Olivier Chapelle, Jason Weston, L´ eon Bottou, and Vladimir Vapnik. Vici- nal risk minimization.Advances in neural information processing systems, 13, 2000

  18. [18]

    Local multidimensional scaling for nonlin- ear dimension reduction, graph drawing, and proximity analysis.Journal of the American Statistical Association, 104:209 – 219, 2009

    Lisha Chen and Andreas Buja. Local multidimensional scaling for nonlin- ear dimension reduction, graph drawing, and proximity analysis.Journal of the American Statistical Association, 104:209 – 219, 2009

  19. [19]

    A tutorial on kernel density estimation and recent ad- vances.Biostatistics & Epidemiology, 1(1):161–187, 2017

    Yen-Chi Chen. A tutorial on kernel density estimation and recent ad- vances.Biostatistics & Epidemiology, 1(1):161–187, 2017

  20. [20]

    Efficient algorithm for localized support vector machine.IEEE Transactions on Knowledge and Data Engineering, 22(4):537–549, 2009

    Haibin Cheng, Pang-Ning Tan, and Rong Jin. Efficient algorithm for localized support vector machine.IEEE Transactions on Knowledge and Data Engineering, 22(4):537–549, 2009

  21. [21]

    Mean shift, mode seeking, and clustering.IEEE transac- tions on pattern analysis and machine intelligence, 17(8):790–799, 1995

    Yizong Cheng. Mean shift, mode seeking, and clustering.IEEE transac- tions on pattern analysis and machine intelligence, 17(8):790–799, 1995

  22. [22]

    MIT Press, Cambridge, MA, 1965

    Noam Chomsky.Aspects of the Theory of Syntax. MIT Press, Cambridge, MA, 1965

  23. [23]

    MIT Press, Cambridge, MA, 1995

    Noam Chomsky.The Minimalist Program. MIT Press, Cambridge, MA, 1995

  24. [24]

    Cleveland

    William S. Cleveland. Robust locally weighted regression and smoothing scatterplots.Journal of the American Statistical Association, 74:829–836, 1979

  25. [25]

    Cleveland and Susan J

    William S. Cleveland and Susan J. Devlin. Locally-weighted regression: an approach to regression analysis by local fitting.Journal of the American Statistical Association, 83:596–610, 1988

  26. [26]

    Comaniciu, V

    D. Comaniciu, V. Ramesh, and P. Meer. Kernel-based object track- ing.IEEE Transactions on Pattern Analysis and Machine Intelligence, 25(5):564–577, 2003. 65

  27. [27]

    Mean shift analysis and applications

    Dorin Comaniciu and Peter Meer. Mean shift analysis and applications. In Proceedings of the Seventh IEEE International Conference on Computer Vision, volume 2, pages 1197–1203. IEEE, 1999

  28. [28]

    Mean shift: A robust approach to- ward feature space analysis.IEEE Transactions on Pattern Analysis and Machine Intelligence, 24(5):603–619, 2002

    Dorin Comaniciu and Peter Meer. Mean shift: A robust approach to- ward feature space analysis.IEEE Transactions on Pattern Analysis and Machine Intelligence, 24(5):603–619, 2002

  29. [29]

    Nearest neighbor pattern classification

    Thomas Cover and Peter Hart. Nearest neighbor pattern classification. IEEE transactions on information theory, 13(1):21–27, 1967

  30. [30]

    Transformer-xl: Attentive language models beyond a fixed-length context

    Zihang Dai, Zhilin Yang, Yiming Yang, William W Cohen, Jaime Car- bonell, Quoc V Le, and Ruslan Salakhutdinov. Transformer-xl: Attentive language models beyond a fixed-length context. 2019

  31. [31]

    On conditional density estimation

    Jan G De Gooijer and Dawit Zerom. On conditional density estimation. Statistica Neerlandica, 57(2):159–176, 2003

  32. [32]

    The road from mle to em to vae: A brief tutorial.AI Open, 2022(3):29–34, 2022

    Ming Ding. The road from mle to em to vae: A brief tutorial.AI Open, 2022(3):29–34, 2022

  33. [33]

    Random graph modeling: A survey of the concepts.ACM computing surveys (CSUR), 52(6):1–36, 2019

    Mikhail Drobyshevskiy and Denis Turdakov. Random graph modeling: A survey of the concepts.ACM computing surveys (CSUR), 52(6):1–36, 2019

  34. [34]

    Tweedie’s formula and selection bias.Journal of the Amer- ican Statistical Association, 106(496):1602–1614, 2011

    Bradley Efron. Tweedie’s formula and selection bias.Journal of the Amer- ican Statistical Association, 106(496):1602–1614, 2011

  35. [35]

    Shinto Eguchi, Tae Yoon Kim, and Byeong U. Park. Local likelihood method: A bridge over parametric and nonparametric regression.Non- parametric Statistics, 15(6):665–683, 2003

  36. [36]

    Ezekiel.Methods of Correlation Analysis

    M. Ezekiel.Methods of Correlation Analysis. John Wiley & Sons, New York, 2nd edition, 1941

  37. [37]

    Fan and I

    J. Fan and I. Gijbels. Variable bandwidth and local linear regression smoothers.Annals Stat., 20:2008–2036, 1992

  38. [38]

    Local maximum likelihood estimation and inference.Journal of the American Statistical Association, 60:591–608, 1998

    Jianqing Fan, Mark Farmen, and Irene Gijbels. Local maximum likelihood estimation and inference.Journal of the American Statistical Association, 60:591–608, 1998

  39. [39]

    Chapman and Hall, London, 1996

    Jianqing Fan and Ir` ene Gijbels.Local Polynomial Modeling and its Ap- plications. Chapman and Hall, London, 1996

  40. [40]

    Local linear discriminant anal- ysis framework using sample neighbors.IEEE Transactions on Neural Networks, 22(7):1119–1132, 2011

    Zizhu Fan, Yong Xu, and David Zhang. Local linear discriminant anal- ysis framework using sample neighbors.IEEE Transactions on Neural Networks, 22(7):1119–1132, 2011. 66

  41. [41]

    Evelyn Fix and Joseph L. Hodges. Discriminatory analysis. nonparametric discrimination: Consistency properties. Report, USAF School of Aviation Medicine, Randolph Field, Texas, 1951. Archived (PDF) from the original on September 26, 2020

  42. [42]

    Franke and G

    R. Franke and G. Nielson. Smooth interpolation of large data sets of scattered data.Int. J. Numer. Methods Eng., 15:1691–1704, 1980

  43. [43]

    Hostetler

    Keinosuke Fukunaga and Larry D. Hostetler. The estimation of the gradi- ent of a density function, with applications in pattern recognition.IEEE Trans. Inf. Theory, 21:32–40, 1975

  44. [44]

    Center-based nearest neighbor clas- sifier.Pattern Recognition, 40(1):346–349, 2007

    Qing-Bin Gao and Zheng-Zhi Wang. Center-based nearest neighbor clas- sifier.Pattern Recognition, 40(1):346–349, 2007

  45. [45]

    Completely lazy learning.IEEE Transactions on Knowledge and Data Engineering, 22(9):1274–1285, 2009

    Eric K Garcia, Sergey Feldman, Maya R Gupta, and Santosh Srivastava. Completely lazy learning.IEEE Transactions on Knowledge and Data Engineering, 22(9):1274–1285, 2009

  46. [46]

    Mul- tidimensional scaling, sammon mapping, and isomap

    Benyamin Ghojogh, Mark Crowley, Fakhri Karray, and Ali Ghodsi. Mul- tidimensional scaling, sammon mapping, and isomap. InElements of Di- mensionality Reduction and Manifold Learning, pages 185–205. Springer, 2023

  47. [47]

    Neighbourhood components analysis.Advances in neural informa- tion processing systems, 17, 2004

    Jacob Goldberger, Geoffrey E Hinton, Sam Roweis, and Russ R Salakhut- dinov. Neighbourhood components analysis.Advances in neural informa- tion processing systems, 17, 2004

  48. [48]

    A local mean-based k-nearest centroid neighbor classifier.The Computer Journal, 55(9):1058– 1071, 01 2011

    Jianping Gou, Zhang Yi, Lan Du, and Taisong Xiong. A local mean-based k-nearest centroid neighbor classifier.The Computer Journal, 55(9):1058– 1071, 01 2011

  49. [49]

    Springer, 2018

    Artur Gramacki.Nonparametric Kernel Density Estimation and Its Com- putational Aspects. Springer, 2018

  50. [50]

    Borgwardt, Malte J

    Arthur Gretton, Karsten M. Borgwardt, Malte J. Rasch, Bernhard Sch¨ olkopf, and Alex Smola. A kernel two-sample test.J. Mach. Learn. Res., 13:723–773, 2012

  51. [51]

    Hastie, R

    T. Hastie, R. Tibshirani, and J.H. Friedman.The Elements of Statisti- cal Learning: Data Mining, Inference, and Prediction. Springer series in statistics. Springer, 2009

  52. [52]

    Locality preserving projections

    Xiaofei He and Partha Niyogi. Locality preserving projections. InPro- ceedings of the 16th International Conference on Neural Information Pro- cessing Systems, page 153–160, Cambridge, MA, USA, 2003. MIT Press

  53. [53]

    Hinton and Sam T

    Geoffrey E. Hinton and Sam T. Roweis. Stochastic neighbor embedding. InNeural Information Processing Systems, 2002. 67

  54. [54]

    N. L. Hjort and M. C. Jones. Locally parametric nonparametric density estimation.The Annals of Statistics, 24(4):1619–1647, 1996

  55. [55]

    Jonathan Ho, Ajay Jain, and P. Abbeel. Denoising diffusion probabilistic models.ArXiv, abs/2006.11239, 2020

  56. [56]

    Thomas Hofmann, Bernhard Scholkopf, and Alexander J. Smola. Kernel methods in machine learning.The Annals of Statistics, 36(3):1171–1220, 2008

  57. [57]

    Neural networks and physical systems with emergent collective computational abilities.Proceedings of the national academy of sciences, 79(8):2554–2558, 1982

    John J Hopfield. Neural networks and physical systems with emergent collective computational abilities.Proceedings of the national academy of sciences, 79(8):2554–2558, 1982

  58. [58]

    On the foundations of relaxation labeling processes.IEEE Transactions on Pattern Analysis and Machine Intelligence, (3):267–287, 1983

    Robert A Hummel and Steven W Zucker. On the foundations of relaxation labeling processes.IEEE Transactions on Pattern Analysis and Machine Intelligence, (3):267–287, 1983

  59. [59]

    Estimation of non-normalized statistical models by score matching.Journal of Machine Learning Research, 6(4):695–709, 2005

    Aapo Hyv¨ arinen. Estimation of non-normalized statistical models by score matching.Journal of Machine Learning Research, 6(4):695–709, 2005

  60. [60]

    Anfis: adaptive-network-based fuzzy inference system.IEEE transactions on systems, man, and cybernetics, 23(3):665–685, 1993

    J-SR Jang. Anfis: adaptive-network-based fuzzy inference system.IEEE transactions on systems, man, and cybernetics, 23(3):665–685, 1993

  61. [61]

    Dimension reduction by local principal component analysis.Neural computation, 9(7):1493–1516, 1997

    Nandakishore Kambhatla and Todd K Leen. Dimension reduction by local principal component analysis.Neural computation, 9(7):1493–1516, 1997

  62. [62]

    A review and compar- ison of bandwidth selection methods for kernel regression.International Statistical Review, 82(2):243–274, 2014

    Max K¨ ohler, Anja Schindler, and Stefan Sperlich. A review and compar- ison of bandwidth selection methods for kernel regression.International Statistical Review, 82(2):243–274, 2014

  63. [63]

    Dense associative memory for pattern recognition.Advances in neural information processing systems, 29:1172– 1180, 2016

    Dmitry Krotov and John J Hopfield. Dense associative memory for pattern recognition.Advances in neural information processing systems, 29:1172– 1180, 2016

  64. [64]

    Raja Kumar, P

    R. Raja Kumar, P. Viswanath, and C. Shobha Bindu. Nearest neighbor classifiers: A review.International Journal of Computational Intelligence Research, 13(2):303–311, 2017

  65. [65]

    Set Transformer: A Framework for Attention-based Permutation-Invariant Neural Networks

    Juho Lee, Yoonho Lee, Jungtaek Kim, Adam R. Kosiorek, Seungjin Choi, and Yee Whye Teh. Set transformer.ArXiv, abs/1810.00825, 2018

  66. [66]

    A note on the convergence of the mean shift.Pattern Recognition, 40(6):1756–1762, 2007

    Xiangru Li, Zhanyi Hu, and Fuchao Wu. A note on the convergence of the mean shift.Pattern Recognition, 40(6):1756–1762, 2007

  67. [67]

    Clive R. Loader. Local likelihood density estimation.The Annals of Statistics, 24(4):1602–1618, 1996

  68. [68]

    Non-negative lapla- cian embedding

    Dijun Luo, Chris Ding, Heng Huang, and Tao Li. Non-negative lapla- cian embedding. In2009 Ninth IEEE International Conference on Data Mining, pages 337–346. IEEE, 2009. 68

  69. [69]

    Thang Luong, Hieu Pham, and Christopher D. Manning. Effec- tive approaches to attention-based neural machine translation.ArXiv, abs/1508.04025, 2015

  70. [70]

    Interpretation and generalization of score matching

    Siwei Lyu. Interpretation and generalization of score matching. 2012

  71. [71]

    Macaulay

    Frederick R. Macaulay. Introduction to ”the smoothing of time series”. In The Smoothing of Time Series, pages 17–30. National Bureau of Economic Research, Inc, 1931

  72. [72]

    Filtered kernel density estimation.Matrix, 1994

    David J Marchette, Carey E Priebe, George W Rogers, and Jeffrey L Solka. Filtered kernel density estimation.Matrix, 1994

  73. [73]

    The racing algorithm: Model selec- tion for lazy learners.Artificial Intelligence Review, 11:193–225, 1997

    Oden Maron and Andrew W Moore. The racing algorithm: Model selec- tion for lazy learners.Artificial Intelligence Review, 11:193–225, 1997

  74. [74]

    Theory of edge detection.Proceedings of the Royal Society of London

    David Marr and Ellen Hildreth. Theory of edge detection.Proceedings of the Royal Society of London. Series B. Biological Sciences, 207(1167):187– 217, 1980

  75. [75]

    UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction

    Leland McInnes, John Healy, and James Melville. Umap: Uniform mani- fold approximation and projection for dimension reduction.arXiv preprint arXiv:1802.03426, 2018

  76. [76]

    Incremental local gaussian regression.Advances in Neural Information Processing Systems, 27, 2014

    Franziska Meier, Philipp Hennig, and Stefan Schaal. Incremental local gaussian regression.Advances in Neural Information Processing Systems, 27, 2014

  77. [77]

    Fuzzy logic systems for engineering: a tutorial.Proceed- ings of the IEEE, 83(3):345–377, 1995

    Jerry M Mendel. Fuzzy logic systems for engineering: a tutorial.Proceed- ings of the IEEE, 83(3):345–377, 1995

  78. [78]

    Con- crete score matching: generalized score matching for discrete data

    Chenlin Meng, Kristy Choi, Jiaming Song, and Stefano Ermon. Con- crete score matching: generalized score matching for discrete data. In Proceedings of the 36th International Conference on Neural Information Processing Systems, NIPS ’22, Red Hook, NY, USA, 2022. Curran Asso- ciates Inc

  79. [79]

    Transmla: Multi-head latent attention is all you need, 2025

    Fanxu Meng, Pingzhi Tang, Zengwei Yao, and Muhan Zhang. Transmla: Multi-head latent attention is all you need, 2025

  80. [80]

    Micchelli, Y

    C. Micchelli, Y. Xu, and H. Zhang. Universal kernels.Journal of Machine Learning Research, 7:2651–2667, 2006

Showing first 80 references.