The General Theory of Localization Methods

Congwei Song

arxiv: 2605.20635 · v1 · pith:QERS5T3Unew · submitted 2026-05-20 · 💻 cs.LG · math.ST· stat.ML· stat.TH

The General Theory of Localization Methods

Congwei Song This is my paper

Pith reviewed 2026-05-21 06:30 UTC · model grok-4.3

classification 💻 cs.LG math.STstat.MLstat.TH

keywords localization methodself-attentionTransformerkernel methodsHopfield networksunified frameworkmachine learning modelshierarchical models

0 comments

The pith

The localization method unifies many machine learning models and reconstructs the Transformer from hierarchical local models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper introduces the localization method as a general framework for machine learning built on localization kernels and local means that support self-attention. It defines the framework through the formulation of the local model and the localization trick to connect rigorously to kernel methods, lazy learning, MeanShift, relaxation labeling, Hopfield networks, local linear embedding, fuzzy inference, and denoising autoencoders. The central result shows that Transformers can be assembled from hierarchical local models, which positions the method as a way to reinterpret existing techniques and to generate new adaptive architectures.

Core claim

Defining the local model and applying the localization trick yields a framework that reinterprets kernel methods, lazy learning, MeanShift, relaxation labeling, Hopfield networks, local linear embedding, fuzzy inference, and denoising autoencoders as instances of localization while extending the same structure to adaptive kernels, hierarchical local models, and the construction of Transformers.

What carries the argument

The localization trick, which weights local means by localization kernels to produce localized models that generalize self-attention.

If this is right

Hopfield networks and denoising autoencoders become special cases of localized models.
Transformers arise directly from stacking hierarchical local models.
Adaptive kernels and non-local extensions fit inside the same formal structure.
New data-adaptive systems can be designed by varying the choice of localization kernel.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same local-means construction might extend naturally to graph-structured or spatial data beyond sequences.
Focusing computation on local kernels could suggest efficiency gains for attention-based models on very long inputs.
Links to fuzzy inference might allow the framework to produce more interpretable decision rules in classification tasks.

Load-bearing premise

The local model and localization trick together suffice to connect and generalize the listed existing models without omitting their essential behaviors.

What would settle it

A step-by-step derivation that recovers the exact standard Transformer attention equations from repeated applications of the localization trick on hierarchical local models would confirm the claim; failure to recover those equations exactly would refute it.

read the original abstract

This paper proposes a general machine learning framework called the localization method, which is fundamentally built on two core concepts: localization kernels and local means -- key components that underpin the self-attention mechanism. To establish a rigorous theoretical foundation, the framework is formally defined through two essential pillars: the formulation of the local(-ized) model and the localization trick. We systematically investigate the connections between the localization method and a wide range of existing machine learning models/methods, including (but not limited to) kernel methods, lazy learning, the MeanShift algorithm, relaxation labeling, Hopfield networks, local linear embedding (LLE), fuzzy inference, and denoising autoencoders (DAEs). By dissecting these relationships, we clarify the broader theoretical significance of the localization method and demonstrate its practical applicability across diverse machine learning tasks. Furthermore, we explore advanced extensions of the framework, such as adaptive kernels, hierarchical local models, and non-local models. Notably, we show that the Transformer -- a cornerstone of modern sequence modeling -- can be constructed using hierarchical local models, revealing the ability of the localization method to unify and generalize state-of-the-art architectures. This work not only provides a unified theoretical lens to reinterpret existing models but also offers new methodological tools for designing flexible, data-adaptive learning systems.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper sketches a unification of local and kernel methods under a new localization framework and claims Transformers arise as hierarchical local models, but the construction may need extra assumptions to match self-attention exactly.

read the letter

The core claim here is that many machine learning techniques can be viewed through the lens of a localization method based on localization kernels and local means, with Transformers emerging as a hierarchical version of local models. The paper does a solid job surveying connections to established methods like kernel methods, the MeanShift algorithm, Hopfield networks, local linear embedding, and denoising autoencoders. These links are drawn clearly enough in the abstract to suggest the framework has some organizing power over that family of approaches. What stands out as new is the explicit construction of Transformers from hierarchical local models using the localization trick. This goes beyond just listing similarities and tries to show how self-attention fits inside the general setup. The main soft spot is whether that construction is exact and general. The stress-test concern hits the mark: to match standard Transformer equations, including the specific projections and scaling, the framework probably needs some additional specifications on the kernels or the hierarchy levels. If those are not derived from the two pillars but added ad hoc, the unification loses some of its force. Since the review is based on the abstract, there's no way to check the actual math or see if there are proofs or examples that make the claims stick. The soundness is limited by that. This paper is aimed at theorists in machine learning who work on attention mechanisms or local nonparametric methods. A reader interested in unifying frameworks could find value in the synthesis, provided the details hold up. It deserves a serious referee. The topic is relevant and the attempt at unification is substantive enough to warrant feedback on the technical parts.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces the localization method as a general framework for machine learning, founded on localization kernels and local means. It establishes the framework through the local model and the localization trick, explores connections to kernel methods, lazy learning, MeanShift, relaxation labeling, Hopfield networks, LLE, fuzzy inference, and denoising autoencoders. It further develops extensions including adaptive kernels, hierarchical local models, and non-local models, and demonstrates that the Transformer architecture can be derived from hierarchical local models.

Significance. Should the connections and the Transformer construction be rigorously derived without circularity or ad-hoc choices, this work could provide a valuable unifying theory for a broad range of machine learning models and architectures. The ability to reconstruct state-of-the-art models like Transformers from the localization framework would strengthen the case for its generality and offer new insights into designing adaptive learning systems.

major comments (2)

[Section on hierarchical local models and Transformer construction] The construction of the Transformer using hierarchical local models must be shown to exactly reproduce the standard self-attention mechanism, including the query, key, value projections and the scaling factor. If the localization trick or kernels are defined in a way that incorporates these elements by construction, the unification claim requires clarification to avoid circularity.
[Formulation of the local model and localization trick] The two pillars (local model formulation and localization trick) are presented as establishing a rigorous foundation, but it is unclear whether the definitions are independent of the models they aim to unify or if they are tailored to fit the connections. A concrete example deriving one of the listed models (e.g., Hopfield networks or MeanShift) from the core definitions would help assess if the framework is generative or descriptive.

minor comments (2)

[Abstract] The abstract lists many connections but does not specify which are novel derivations versus reinterpretations; clarifying this would improve the presentation.
[Notation] Ensure consistent use of notation for localization kernels and local means throughout the paper to avoid confusion with standard kernel methods.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments and the recommendation for major revision. We address the concerns regarding the rigor of the Transformer derivation and the independence of the core definitions below, and have revised the manuscript accordingly to provide explicit derivations and examples.

read point-by-point responses

Referee: [Section on hierarchical local models and Transformer construction] The construction of the Transformer using hierarchical local models must be shown to exactly reproduce the standard self-attention mechanism, including the query, key, value projections and the scaling factor. If the localization trick or kernels are defined in a way that incorporates these elements by construction, the unification claim requires clarification to avoid circularity.

Authors: We agree that an explicit, step-by-step derivation is required to substantiate the claim. In the revised manuscript we have expanded the relevant section to derive the standard self-attention mechanism directly from the hierarchical local-model construction. The query, key, and value projections arise from the specific choice of local models at each level of the hierarchy, while the scaling factor follows from the normalization inherent in the localization kernel; neither is presupposed in the general definitions. The core localization kernels and local means are introduced independently of any target architecture, and the Transformer is obtained as one particular hierarchical specialization. We have added a clarifying paragraph stating that the framework is therefore not circular. revision: yes
Referee: [Formulation of the local model and localization trick] The two pillars (local model formulation and localization trick) are presented as establishing a rigorous foundation, but it is unclear whether the definitions are independent of the models they aim to unify or if they are tailored to fit the connections. A concrete example deriving one of the listed models (e.g., Hopfield networks or MeanShift) from the core definitions would help assess if the framework is generative or descriptive.

Authors: To demonstrate that the framework is generative, the revised manuscript now contains a fully worked derivation of the MeanShift algorithm starting from the general local-model formulation and localization trick. The derivation begins with the abstract definitions of localization kernels and local means and arrives at the standard MeanShift update without additional assumptions. A shorter outline deriving the Hopfield network is also supplied. These examples illustrate that the two pillars are formulated from first principles of localization and are not tailored to the models they later recover. revision: yes

Circularity Check

0 steps flagged

No significant circularity; framework definitions enable explicit mappings to prior models without reducing to inputs by construction.

full rationale

The paper introduces the localization method via two pillars (local model formulation and localization trick) that are defined independently of the specific models they later connect to. It then derives connections to kernel methods, MeanShift, Hopfield networks, LLE, and others through explicit reformulations, and constructs the Transformer as a special case of hierarchical local models. These steps rely on general definitions of kernels and local means rather than fitting parameters to target outputs or invoking self-citations for uniqueness. No equation or claim reduces to a tautology or renamed input; the unification follows from applying the core concepts to each architecture. The derivation remains self-contained against external benchmarks like standard self-attention equations.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 2 invented entities

Limited to abstract content; the paper introduces new core concepts without mentioning fitted parameters or external benchmarks. The definitional pillars serve as the primary axioms, and localization kernels/local means function as invented entities central to the claim.

axioms (1)

domain assumption The localization method is formally defined through the formulation of the local(-ized) model and the localization trick.
Stated in the abstract as the two essential pillars establishing the rigorous theoretical foundation.

invented entities (2)

localization kernels no independent evidence
purpose: Core component that underpins the self-attention mechanism and the localization method.
Introduced as one of the two fundamental building blocks of the framework.
local means no independent evidence
purpose: Key component that underpins the self-attention mechanism and the localization method.
Introduced as one of the two fundamental building blocks of the framework.

pith-pipeline@v0.9.0 · 5749 in / 1574 out tokens · 50333 ms · 2026-05-21T06:30:09.535066+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

local loss J(x*,θ) := ∑ K(x*,xi) l(xi,θ); local mean ŷ(x*) = ∑ K(x*,xi) yi / ∑ K(x*,xi); temporal kernel for self-attention K(xt,t,xs,s)
IndisputableMonolith/Foundation/AlexanderDuality.lean alexander_duality_circle_linking unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

hierarchical local models construct Transformer via localization trick on simple models

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

151 extracted references · 151 canonical work pages · 5 internal anchors

[1]

Saliency detection using maximum symmetric surround

Radhakrishna Achanta and Sabine S¨ usstrunk. Saliency detection using maximum symmetric surround. In2010 IEEE International Conference on Image Processing, pages 2653–2656. IEEE, 2010

work page 2010
[2]

Instance-based learning algorithms.Machine learning, 6:37–66, 1991

David W Aha, Dennis Kibler, and Marc K Albert. Instance-based learning algorithms.Machine learning, 6:37–66, 1991

work page 1991
[3]

Ainslie, J

J. Ainslie, J. Lee Thorp, M. De Jong, Y. Zemlyanskiy, F. Lebr´ on, and S. Sanghai. Gqa: Training generalized multi-query transformer models from multi-head checkpoints, 2023

work page 2023
[4]

Gsns: generative stochastic networks.Information and Inference: A Journal of the IMA, 5(2):210– 249, 2016

Guillaume Alain, Yoshua Bengio, Li Yao, Jason Yosinski, Eric Thibodeau- Laufer, Saizheng Zhang, and Pascal Vincent. Gsns: generative stochastic networks.Information and Inference: A Journal of the IMA, 5(2):210– 249, 2016

work page 2016
[5]

Hebbian learning from first principles.Journal of Mathe- matical Physics, 65(11):113302, 2024

Linda Albanese, Adriano Barra, Pierluigi Bianco, Fabrizio Durante, and Diego Pallara. Hebbian learning from first principles.Journal of Mathe- matical Physics, 65(11):113302, 2024

work page 2024
[6]

The mean shift algorithm and its relation to kernel regression.Information Sciences, 348:198–208, 2016

Youness Aliyari Ghassabeh and Frank Rudzicz. The mean shift algorithm and its relation to kernel regression.Information Sciences, 348:198–208, 2016

work page 2016
[7]

Amid and M

Ehsan Amid and Manfred K Warmuth. Trimap: Large-scale dimension- ality reduction using triplets.arXiv preprint arXiv:1910.00204, 2019

work page arXiv 1910
[8]

Springer, 2006

Gilles Aubert, Pierre Kornprobst, and Giles Aubert.Mathematical prob- lems in image processing: partial differential equations and the calculus of variations, volume 147. Springer, 2006

work page 2006
[9]

Analytic-dpm: an analytic estimate of the optimal reverse variance in diffusion probabilistic models

Fan Bao, Chongxuan Li, Jun Zhu, and Bo Zhang. Analytic-dpm: an analytic estimate of the optimal reverse variance in diffusion probabilistic models. 2022

work page 2022
[10]

Laplacian eigenmaps and spectral techniques for embedding and clustering

Mikhail Belkin and Partha Niyogi. Laplacian eigenmaps and spectral techniques for embedding and clustering. In T. Dietterich, S. Becker, and Z. Ghahramani, editors,Advances in Neural Information Processing Systems, volume 14. MIT Press, 2001

work page 2001
[11]

P. K. Bhattacharya. Estimation of a probability density function and its derivatives.Sankhy¯ a: The Indian Journal of Statistics, Series A (1961- 2002), 29(4):373–382, 1967

work page 1961
[12]

Lazy learning for local modelling and control design.International Journal of Control, 72(7-8):643–658, 1999

Gianluca Bontempi, Mauro Birattari, and Hugues Bersini. Lazy learning for local modelling and control design.International Journal of Control, 72(7-8):643–658, 1999. 64

work page 1999
[13]

Nonparametric density estimation via diffusion mixing

Zdravko Botev. Nonparametric density estimation via diffusion mixing. Technical report, University of Queensland, 2007

work page 2007
[14]

Breunig, Hans-Peter Kriegel, Raymond T

Markus M. Breunig, Hans-Peter Kriegel, Raymond T. Ng, and J¨ org Sander. Lof: identifying density-based local outliers. InProceedings of the 2000 ACM SIGMOD International Conference on Management of Data, SIGMOD ’00, page 93–104, New York, NY, USA, 2000. Association for Computing Machinery

work page 2000
[15]

A review of image denoising algorithms, with a new one.Multiscale Model

Antoni Buades, Bartomeu Coll, and Jean-Michel Morel. A review of image denoising algorithms, with a new one.Multiscale Model. Simul., 4:490– 530, 2005

work page 2005
[16]

Neighborhood filters and pde’s.Numerische Mathematik, 105:1–34, 2006

Antoni Buades, Bartomeu Coll, and Jean-Michel Morel. Neighborhood filters and pde’s.Numerische Mathematik, 105:1–34, 2006

work page 2006
[17]

Vici- nal risk minimization.Advances in neural information processing systems, 13, 2000

Olivier Chapelle, Jason Weston, L´ eon Bottou, and Vladimir Vapnik. Vici- nal risk minimization.Advances in neural information processing systems, 13, 2000

work page 2000
[18]

Local multidimensional scaling for nonlin- ear dimension reduction, graph drawing, and proximity analysis.Journal of the American Statistical Association, 104:209 – 219, 2009

Lisha Chen and Andreas Buja. Local multidimensional scaling for nonlin- ear dimension reduction, graph drawing, and proximity analysis.Journal of the American Statistical Association, 104:209 – 219, 2009

work page 2009
[19]

A tutorial on kernel density estimation and recent ad- vances.Biostatistics & Epidemiology, 1(1):161–187, 2017

Yen-Chi Chen. A tutorial on kernel density estimation and recent ad- vances.Biostatistics & Epidemiology, 1(1):161–187, 2017

work page 2017
[20]

Efficient algorithm for localized support vector machine.IEEE Transactions on Knowledge and Data Engineering, 22(4):537–549, 2009

Haibin Cheng, Pang-Ning Tan, and Rong Jin. Efficient algorithm for localized support vector machine.IEEE Transactions on Knowledge and Data Engineering, 22(4):537–549, 2009

work page 2009
[21]

Mean shift, mode seeking, and clustering.IEEE transac- tions on pattern analysis and machine intelligence, 17(8):790–799, 1995

Yizong Cheng. Mean shift, mode seeking, and clustering.IEEE transac- tions on pattern analysis and machine intelligence, 17(8):790–799, 1995

work page 1995
[22]

MIT Press, Cambridge, MA, 1965

Noam Chomsky.Aspects of the Theory of Syntax. MIT Press, Cambridge, MA, 1965

work page 1965
[23]

MIT Press, Cambridge, MA, 1995

Noam Chomsky.The Minimalist Program. MIT Press, Cambridge, MA, 1995

work page 1995
[24]

Cleveland

William S. Cleveland. Robust locally weighted regression and smoothing scatterplots.Journal of the American Statistical Association, 74:829–836, 1979

work page 1979
[25]

Cleveland and Susan J

William S. Cleveland and Susan J. Devlin. Locally-weighted regression: an approach to regression analysis by local fitting.Journal of the American Statistical Association, 83:596–610, 1988

work page 1988
[26]

Comaniciu, V

D. Comaniciu, V. Ramesh, and P. Meer. Kernel-based object track- ing.IEEE Transactions on Pattern Analysis and Machine Intelligence, 25(5):564–577, 2003. 65

work page 2003
[27]

Mean shift analysis and applications

Dorin Comaniciu and Peter Meer. Mean shift analysis and applications. In Proceedings of the Seventh IEEE International Conference on Computer Vision, volume 2, pages 1197–1203. IEEE, 1999

work page 1999
[28]

Mean shift: A robust approach to- ward feature space analysis.IEEE Transactions on Pattern Analysis and Machine Intelligence, 24(5):603–619, 2002

Dorin Comaniciu and Peter Meer. Mean shift: A robust approach to- ward feature space analysis.IEEE Transactions on Pattern Analysis and Machine Intelligence, 24(5):603–619, 2002

work page 2002
[29]

Nearest neighbor pattern classification

Thomas Cover and Peter Hart. Nearest neighbor pattern classification. IEEE transactions on information theory, 13(1):21–27, 1967

work page 1967
[30]

Transformer-xl: Attentive language models beyond a fixed-length context

Zihang Dai, Zhilin Yang, Yiming Yang, William W Cohen, Jaime Car- bonell, Quoc V Le, and Ruslan Salakhutdinov. Transformer-xl: Attentive language models beyond a fixed-length context. 2019

work page 2019
[31]

On conditional density estimation

Jan G De Gooijer and Dawit Zerom. On conditional density estimation. Statistica Neerlandica, 57(2):159–176, 2003

work page 2003
[32]

The road from mle to em to vae: A brief tutorial.AI Open, 2022(3):29–34, 2022

Ming Ding. The road from mle to em to vae: A brief tutorial.AI Open, 2022(3):29–34, 2022

work page 2022
[33]

Random graph modeling: A survey of the concepts.ACM computing surveys (CSUR), 52(6):1–36, 2019

Mikhail Drobyshevskiy and Denis Turdakov. Random graph modeling: A survey of the concepts.ACM computing surveys (CSUR), 52(6):1–36, 2019

work page 2019
[34]

Tweedie’s formula and selection bias.Journal of the Amer- ican Statistical Association, 106(496):1602–1614, 2011

Bradley Efron. Tweedie’s formula and selection bias.Journal of the Amer- ican Statistical Association, 106(496):1602–1614, 2011

work page 2011
[35]

Shinto Eguchi, Tae Yoon Kim, and Byeong U. Park. Local likelihood method: A bridge over parametric and nonparametric regression.Non- parametric Statistics, 15(6):665–683, 2003

work page 2003
[36]

Ezekiel.Methods of Correlation Analysis

M. Ezekiel.Methods of Correlation Analysis. John Wiley & Sons, New York, 2nd edition, 1941

work page 1941
[37]

Fan and I

J. Fan and I. Gijbels. Variable bandwidth and local linear regression smoothers.Annals Stat., 20:2008–2036, 1992

work page 2008
[38]

Local maximum likelihood estimation and inference.Journal of the American Statistical Association, 60:591–608, 1998

Jianqing Fan, Mark Farmen, and Irene Gijbels. Local maximum likelihood estimation and inference.Journal of the American Statistical Association, 60:591–608, 1998

work page 1998
[39]

Chapman and Hall, London, 1996

Jianqing Fan and Ir` ene Gijbels.Local Polynomial Modeling and its Ap- plications. Chapman and Hall, London, 1996

work page 1996
[40]

Local linear discriminant anal- ysis framework using sample neighbors.IEEE Transactions on Neural Networks, 22(7):1119–1132, 2011

Zizhu Fan, Yong Xu, and David Zhang. Local linear discriminant anal- ysis framework using sample neighbors.IEEE Transactions on Neural Networks, 22(7):1119–1132, 2011. 66

work page 2011
[41]

Evelyn Fix and Joseph L. Hodges. Discriminatory analysis. nonparametric discrimination: Consistency properties. Report, USAF School of Aviation Medicine, Randolph Field, Texas, 1951. Archived (PDF) from the original on September 26, 2020

work page 1951
[42]

Franke and G

R. Franke and G. Nielson. Smooth interpolation of large data sets of scattered data.Int. J. Numer. Methods Eng., 15:1691–1704, 1980

work page 1980
[43]

Hostetler

Keinosuke Fukunaga and Larry D. Hostetler. The estimation of the gradi- ent of a density function, with applications in pattern recognition.IEEE Trans. Inf. Theory, 21:32–40, 1975

work page 1975
[44]

Center-based nearest neighbor clas- sifier.Pattern Recognition, 40(1):346–349, 2007

Qing-Bin Gao and Zheng-Zhi Wang. Center-based nearest neighbor clas- sifier.Pattern Recognition, 40(1):346–349, 2007

work page 2007
[45]

Completely lazy learning.IEEE Transactions on Knowledge and Data Engineering, 22(9):1274–1285, 2009

Eric K Garcia, Sergey Feldman, Maya R Gupta, and Santosh Srivastava. Completely lazy learning.IEEE Transactions on Knowledge and Data Engineering, 22(9):1274–1285, 2009

work page 2009
[46]

Mul- tidimensional scaling, sammon mapping, and isomap

Benyamin Ghojogh, Mark Crowley, Fakhri Karray, and Ali Ghodsi. Mul- tidimensional scaling, sammon mapping, and isomap. InElements of Di- mensionality Reduction and Manifold Learning, pages 185–205. Springer, 2023

work page 2023
[47]

Neighbourhood components analysis.Advances in neural informa- tion processing systems, 17, 2004

Jacob Goldberger, Geoffrey E Hinton, Sam Roweis, and Russ R Salakhut- dinov. Neighbourhood components analysis.Advances in neural informa- tion processing systems, 17, 2004

work page 2004
[48]

A local mean-based k-nearest centroid neighbor classifier.The Computer Journal, 55(9):1058– 1071, 01 2011

Jianping Gou, Zhang Yi, Lan Du, and Taisong Xiong. A local mean-based k-nearest centroid neighbor classifier.The Computer Journal, 55(9):1058– 1071, 01 2011

work page 2011
[49]

Springer, 2018

Artur Gramacki.Nonparametric Kernel Density Estimation and Its Com- putational Aspects. Springer, 2018

work page 2018
[50]

Borgwardt, Malte J

Arthur Gretton, Karsten M. Borgwardt, Malte J. Rasch, Bernhard Sch¨ olkopf, and Alex Smola. A kernel two-sample test.J. Mach. Learn. Res., 13:723–773, 2012

work page 2012
[51]

Hastie, R

T. Hastie, R. Tibshirani, and J.H. Friedman.The Elements of Statisti- cal Learning: Data Mining, Inference, and Prediction. Springer series in statistics. Springer, 2009

work page 2009
[52]

Locality preserving projections

Xiaofei He and Partha Niyogi. Locality preserving projections. InPro- ceedings of the 16th International Conference on Neural Information Pro- cessing Systems, page 153–160, Cambridge, MA, USA, 2003. MIT Press

work page 2003
[53]

Hinton and Sam T

Geoffrey E. Hinton and Sam T. Roweis. Stochastic neighbor embedding. InNeural Information Processing Systems, 2002. 67

work page 2002
[54]

N. L. Hjort and M. C. Jones. Locally parametric nonparametric density estimation.The Annals of Statistics, 24(4):1619–1647, 1996

work page 1996
[55]

Jonathan Ho, Ajay Jain, and P. Abbeel. Denoising diffusion probabilistic models.ArXiv, abs/2006.11239, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2006
[56]

Thomas Hofmann, Bernhard Scholkopf, and Alexander J. Smola. Kernel methods in machine learning.The Annals of Statistics, 36(3):1171–1220, 2008

work page 2008
[57]

Neural networks and physical systems with emergent collective computational abilities.Proceedings of the national academy of sciences, 79(8):2554–2558, 1982

John J Hopfield. Neural networks and physical systems with emergent collective computational abilities.Proceedings of the national academy of sciences, 79(8):2554–2558, 1982

work page 1982
[58]

On the foundations of relaxation labeling processes.IEEE Transactions on Pattern Analysis and Machine Intelligence, (3):267–287, 1983

Robert A Hummel and Steven W Zucker. On the foundations of relaxation labeling processes.IEEE Transactions on Pattern Analysis and Machine Intelligence, (3):267–287, 1983

work page 1983
[59]

Estimation of non-normalized statistical models by score matching.Journal of Machine Learning Research, 6(4):695–709, 2005

Aapo Hyv¨ arinen. Estimation of non-normalized statistical models by score matching.Journal of Machine Learning Research, 6(4):695–709, 2005

work page 2005
[60]

Anfis: adaptive-network-based fuzzy inference system.IEEE transactions on systems, man, and cybernetics, 23(3):665–685, 1993

J-SR Jang. Anfis: adaptive-network-based fuzzy inference system.IEEE transactions on systems, man, and cybernetics, 23(3):665–685, 1993

work page 1993
[61]

Dimension reduction by local principal component analysis.Neural computation, 9(7):1493–1516, 1997

Nandakishore Kambhatla and Todd K Leen. Dimension reduction by local principal component analysis.Neural computation, 9(7):1493–1516, 1997

work page 1997
[62]

A review and compar- ison of bandwidth selection methods for kernel regression.International Statistical Review, 82(2):243–274, 2014

Max K¨ ohler, Anja Schindler, and Stefan Sperlich. A review and compar- ison of bandwidth selection methods for kernel regression.International Statistical Review, 82(2):243–274, 2014

work page 2014
[63]

Dense associative memory for pattern recognition.Advances in neural information processing systems, 29:1172– 1180, 2016

Dmitry Krotov and John J Hopfield. Dense associative memory for pattern recognition.Advances in neural information processing systems, 29:1172– 1180, 2016

work page 2016
[64]

Raja Kumar, P

R. Raja Kumar, P. Viswanath, and C. Shobha Bindu. Nearest neighbor classifiers: A review.International Journal of Computational Intelligence Research, 13(2):303–311, 2017

work page 2017
[65]

Set Transformer: A Framework for Attention-based Permutation-Invariant Neural Networks

Juho Lee, Yoonho Lee, Jungtaek Kim, Adam R. Kosiorek, Seungjin Choi, and Yee Whye Teh. Set transformer.ArXiv, abs/1810.00825, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[66]

A note on the convergence of the mean shift.Pattern Recognition, 40(6):1756–1762, 2007

Xiangru Li, Zhanyi Hu, and Fuchao Wu. A note on the convergence of the mean shift.Pattern Recognition, 40(6):1756–1762, 2007

work page 2007
[67]

Clive R. Loader. Local likelihood density estimation.The Annals of Statistics, 24(4):1602–1618, 1996

work page 1996
[68]

Non-negative lapla- cian embedding

Dijun Luo, Chris Ding, Heng Huang, and Tao Li. Non-negative lapla- cian embedding. In2009 Ninth IEEE International Conference on Data Mining, pages 337–346. IEEE, 2009. 68

work page 2009
[69]

Thang Luong, Hieu Pham, and Christopher D. Manning. Effec- tive approaches to attention-based neural machine translation.ArXiv, abs/1508.04025, 2015

work page internal anchor Pith review Pith/arXiv arXiv 2015
[70]

Interpretation and generalization of score matching

Siwei Lyu. Interpretation and generalization of score matching. 2012

work page 2012
[71]

Macaulay

Frederick R. Macaulay. Introduction to ”the smoothing of time series”. In The Smoothing of Time Series, pages 17–30. National Bureau of Economic Research, Inc, 1931

work page 1931
[72]

Filtered kernel density estimation.Matrix, 1994

David J Marchette, Carey E Priebe, George W Rogers, and Jeffrey L Solka. Filtered kernel density estimation.Matrix, 1994

work page 1994
[73]

The racing algorithm: Model selec- tion for lazy learners.Artificial Intelligence Review, 11:193–225, 1997

Oden Maron and Andrew W Moore. The racing algorithm: Model selec- tion for lazy learners.Artificial Intelligence Review, 11:193–225, 1997

work page 1997
[74]

Theory of edge detection.Proceedings of the Royal Society of London

David Marr and Ellen Hildreth. Theory of edge detection.Proceedings of the Royal Society of London. Series B. Biological Sciences, 207(1167):187– 217, 1980

work page 1980
[75]

UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction

Leland McInnes, John Healy, and James Melville. Umap: Uniform mani- fold approximation and projection for dimension reduction.arXiv preprint arXiv:1802.03426, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[76]

Incremental local gaussian regression.Advances in Neural Information Processing Systems, 27, 2014

Franziska Meier, Philipp Hennig, and Stefan Schaal. Incremental local gaussian regression.Advances in Neural Information Processing Systems, 27, 2014

work page 2014
[77]

Fuzzy logic systems for engineering: a tutorial.Proceed- ings of the IEEE, 83(3):345–377, 1995

Jerry M Mendel. Fuzzy logic systems for engineering: a tutorial.Proceed- ings of the IEEE, 83(3):345–377, 1995

work page 1995
[78]

Con- crete score matching: generalized score matching for discrete data

Chenlin Meng, Kristy Choi, Jiaming Song, and Stefano Ermon. Con- crete score matching: generalized score matching for discrete data. In Proceedings of the 36th International Conference on Neural Information Processing Systems, NIPS ’22, Red Hook, NY, USA, 2022. Curran Asso- ciates Inc

work page 2022
[79]

Transmla: Multi-head latent attention is all you need, 2025

Fanxu Meng, Pingzhi Tang, Zengwei Yao, and Muhan Zhang. Transmla: Multi-head latent attention is all you need, 2025

work page 2025
[80]

Micchelli, Y

C. Micchelli, Y. Xu, and H. Zhang. Universal kernels.Journal of Machine Learning Research, 7:2651–2667, 2006

work page 2006

Showing first 80 references.

[1] [1]

Saliency detection using maximum symmetric surround

Radhakrishna Achanta and Sabine S¨ usstrunk. Saliency detection using maximum symmetric surround. In2010 IEEE International Conference on Image Processing, pages 2653–2656. IEEE, 2010

work page 2010

[2] [2]

Instance-based learning algorithms.Machine learning, 6:37–66, 1991

David W Aha, Dennis Kibler, and Marc K Albert. Instance-based learning algorithms.Machine learning, 6:37–66, 1991

work page 1991

[3] [3]

Ainslie, J

J. Ainslie, J. Lee Thorp, M. De Jong, Y. Zemlyanskiy, F. Lebr´ on, and S. Sanghai. Gqa: Training generalized multi-query transformer models from multi-head checkpoints, 2023

work page 2023

[4] [4]

Gsns: generative stochastic networks.Information and Inference: A Journal of the IMA, 5(2):210– 249, 2016

Guillaume Alain, Yoshua Bengio, Li Yao, Jason Yosinski, Eric Thibodeau- Laufer, Saizheng Zhang, and Pascal Vincent. Gsns: generative stochastic networks.Information and Inference: A Journal of the IMA, 5(2):210– 249, 2016

work page 2016

[5] [5]

Hebbian learning from first principles.Journal of Mathe- matical Physics, 65(11):113302, 2024

Linda Albanese, Adriano Barra, Pierluigi Bianco, Fabrizio Durante, and Diego Pallara. Hebbian learning from first principles.Journal of Mathe- matical Physics, 65(11):113302, 2024

work page 2024

[6] [6]

The mean shift algorithm and its relation to kernel regression.Information Sciences, 348:198–208, 2016

Youness Aliyari Ghassabeh and Frank Rudzicz. The mean shift algorithm and its relation to kernel regression.Information Sciences, 348:198–208, 2016

work page 2016

[7] [7]

Amid and M

Ehsan Amid and Manfred K Warmuth. Trimap: Large-scale dimension- ality reduction using triplets.arXiv preprint arXiv:1910.00204, 2019

work page arXiv 1910

[8] [8]

Springer, 2006

Gilles Aubert, Pierre Kornprobst, and Giles Aubert.Mathematical prob- lems in image processing: partial differential equations and the calculus of variations, volume 147. Springer, 2006

work page 2006

[9] [9]

Analytic-dpm: an analytic estimate of the optimal reverse variance in diffusion probabilistic models

Fan Bao, Chongxuan Li, Jun Zhu, and Bo Zhang. Analytic-dpm: an analytic estimate of the optimal reverse variance in diffusion probabilistic models. 2022

work page 2022

[10] [10]

Laplacian eigenmaps and spectral techniques for embedding and clustering

Mikhail Belkin and Partha Niyogi. Laplacian eigenmaps and spectral techniques for embedding and clustering. In T. Dietterich, S. Becker, and Z. Ghahramani, editors,Advances in Neural Information Processing Systems, volume 14. MIT Press, 2001

work page 2001

[11] [11]

P. K. Bhattacharya. Estimation of a probability density function and its derivatives.Sankhy¯ a: The Indian Journal of Statistics, Series A (1961- 2002), 29(4):373–382, 1967

work page 1961

[12] [12]

Lazy learning for local modelling and control design.International Journal of Control, 72(7-8):643–658, 1999

Gianluca Bontempi, Mauro Birattari, and Hugues Bersini. Lazy learning for local modelling and control design.International Journal of Control, 72(7-8):643–658, 1999. 64

work page 1999

[13] [13]

Nonparametric density estimation via diffusion mixing

Zdravko Botev. Nonparametric density estimation via diffusion mixing. Technical report, University of Queensland, 2007

work page 2007

[14] [14]

Breunig, Hans-Peter Kriegel, Raymond T

Markus M. Breunig, Hans-Peter Kriegel, Raymond T. Ng, and J¨ org Sander. Lof: identifying density-based local outliers. InProceedings of the 2000 ACM SIGMOD International Conference on Management of Data, SIGMOD ’00, page 93–104, New York, NY, USA, 2000. Association for Computing Machinery

work page 2000

[15] [15]

A review of image denoising algorithms, with a new one.Multiscale Model

Antoni Buades, Bartomeu Coll, and Jean-Michel Morel. A review of image denoising algorithms, with a new one.Multiscale Model. Simul., 4:490– 530, 2005

work page 2005

[16] [16]

Neighborhood filters and pde’s.Numerische Mathematik, 105:1–34, 2006

Antoni Buades, Bartomeu Coll, and Jean-Michel Morel. Neighborhood filters and pde’s.Numerische Mathematik, 105:1–34, 2006

work page 2006

[17] [17]

Vici- nal risk minimization.Advances in neural information processing systems, 13, 2000

Olivier Chapelle, Jason Weston, L´ eon Bottou, and Vladimir Vapnik. Vici- nal risk minimization.Advances in neural information processing systems, 13, 2000

work page 2000

[18] [18]

Local multidimensional scaling for nonlin- ear dimension reduction, graph drawing, and proximity analysis.Journal of the American Statistical Association, 104:209 – 219, 2009

Lisha Chen and Andreas Buja. Local multidimensional scaling for nonlin- ear dimension reduction, graph drawing, and proximity analysis.Journal of the American Statistical Association, 104:209 – 219, 2009

work page 2009

[19] [19]

A tutorial on kernel density estimation and recent ad- vances.Biostatistics & Epidemiology, 1(1):161–187, 2017

Yen-Chi Chen. A tutorial on kernel density estimation and recent ad- vances.Biostatistics & Epidemiology, 1(1):161–187, 2017

work page 2017

[20] [20]

Efficient algorithm for localized support vector machine.IEEE Transactions on Knowledge and Data Engineering, 22(4):537–549, 2009

Haibin Cheng, Pang-Ning Tan, and Rong Jin. Efficient algorithm for localized support vector machine.IEEE Transactions on Knowledge and Data Engineering, 22(4):537–549, 2009

work page 2009

[21] [21]

Mean shift, mode seeking, and clustering.IEEE transac- tions on pattern analysis and machine intelligence, 17(8):790–799, 1995

Yizong Cheng. Mean shift, mode seeking, and clustering.IEEE transac- tions on pattern analysis and machine intelligence, 17(8):790–799, 1995

work page 1995

[22] [22]

MIT Press, Cambridge, MA, 1965

Noam Chomsky.Aspects of the Theory of Syntax. MIT Press, Cambridge, MA, 1965

work page 1965

[23] [23]

MIT Press, Cambridge, MA, 1995

Noam Chomsky.The Minimalist Program. MIT Press, Cambridge, MA, 1995

work page 1995

[24] [24]

Cleveland

William S. Cleveland. Robust locally weighted regression and smoothing scatterplots.Journal of the American Statistical Association, 74:829–836, 1979

work page 1979

[25] [25]

Cleveland and Susan J

William S. Cleveland and Susan J. Devlin. Locally-weighted regression: an approach to regression analysis by local fitting.Journal of the American Statistical Association, 83:596–610, 1988

work page 1988

[26] [26]

Comaniciu, V

D. Comaniciu, V. Ramesh, and P. Meer. Kernel-based object track- ing.IEEE Transactions on Pattern Analysis and Machine Intelligence, 25(5):564–577, 2003. 65

work page 2003

[27] [27]

Mean shift analysis and applications

Dorin Comaniciu and Peter Meer. Mean shift analysis and applications. In Proceedings of the Seventh IEEE International Conference on Computer Vision, volume 2, pages 1197–1203. IEEE, 1999

work page 1999

[28] [28]

Mean shift: A robust approach to- ward feature space analysis.IEEE Transactions on Pattern Analysis and Machine Intelligence, 24(5):603–619, 2002

Dorin Comaniciu and Peter Meer. Mean shift: A robust approach to- ward feature space analysis.IEEE Transactions on Pattern Analysis and Machine Intelligence, 24(5):603–619, 2002

work page 2002

[29] [29]

Nearest neighbor pattern classification

Thomas Cover and Peter Hart. Nearest neighbor pattern classification. IEEE transactions on information theory, 13(1):21–27, 1967

work page 1967

[30] [30]

Transformer-xl: Attentive language models beyond a fixed-length context

Zihang Dai, Zhilin Yang, Yiming Yang, William W Cohen, Jaime Car- bonell, Quoc V Le, and Ruslan Salakhutdinov. Transformer-xl: Attentive language models beyond a fixed-length context. 2019

work page 2019

[31] [31]

On conditional density estimation

Jan G De Gooijer and Dawit Zerom. On conditional density estimation. Statistica Neerlandica, 57(2):159–176, 2003

work page 2003

[32] [32]

The road from mle to em to vae: A brief tutorial.AI Open, 2022(3):29–34, 2022

Ming Ding. The road from mle to em to vae: A brief tutorial.AI Open, 2022(3):29–34, 2022

work page 2022

[33] [33]

Random graph modeling: A survey of the concepts.ACM computing surveys (CSUR), 52(6):1–36, 2019

Mikhail Drobyshevskiy and Denis Turdakov. Random graph modeling: A survey of the concepts.ACM computing surveys (CSUR), 52(6):1–36, 2019

work page 2019

[34] [34]

Tweedie’s formula and selection bias.Journal of the Amer- ican Statistical Association, 106(496):1602–1614, 2011

Bradley Efron. Tweedie’s formula and selection bias.Journal of the Amer- ican Statistical Association, 106(496):1602–1614, 2011

work page 2011

[35] [35]

Shinto Eguchi, Tae Yoon Kim, and Byeong U. Park. Local likelihood method: A bridge over parametric and nonparametric regression.Non- parametric Statistics, 15(6):665–683, 2003

work page 2003

[36] [36]

Ezekiel.Methods of Correlation Analysis

M. Ezekiel.Methods of Correlation Analysis. John Wiley & Sons, New York, 2nd edition, 1941

work page 1941

[37] [37]

Fan and I

J. Fan and I. Gijbels. Variable bandwidth and local linear regression smoothers.Annals Stat., 20:2008–2036, 1992

work page 2008

[38] [38]

Local maximum likelihood estimation and inference.Journal of the American Statistical Association, 60:591–608, 1998

Jianqing Fan, Mark Farmen, and Irene Gijbels. Local maximum likelihood estimation and inference.Journal of the American Statistical Association, 60:591–608, 1998

work page 1998

[39] [39]

Chapman and Hall, London, 1996

Jianqing Fan and Ir` ene Gijbels.Local Polynomial Modeling and its Ap- plications. Chapman and Hall, London, 1996

work page 1996

[40] [40]

Local linear discriminant anal- ysis framework using sample neighbors.IEEE Transactions on Neural Networks, 22(7):1119–1132, 2011

Zizhu Fan, Yong Xu, and David Zhang. Local linear discriminant anal- ysis framework using sample neighbors.IEEE Transactions on Neural Networks, 22(7):1119–1132, 2011. 66

work page 2011

[41] [41]

Evelyn Fix and Joseph L. Hodges. Discriminatory analysis. nonparametric discrimination: Consistency properties. Report, USAF School of Aviation Medicine, Randolph Field, Texas, 1951. Archived (PDF) from the original on September 26, 2020

work page 1951

[42] [42]

Franke and G

R. Franke and G. Nielson. Smooth interpolation of large data sets of scattered data.Int. J. Numer. Methods Eng., 15:1691–1704, 1980

work page 1980

[43] [43]

Hostetler

Keinosuke Fukunaga and Larry D. Hostetler. The estimation of the gradi- ent of a density function, with applications in pattern recognition.IEEE Trans. Inf. Theory, 21:32–40, 1975

work page 1975

[44] [44]

Center-based nearest neighbor clas- sifier.Pattern Recognition, 40(1):346–349, 2007

Qing-Bin Gao and Zheng-Zhi Wang. Center-based nearest neighbor clas- sifier.Pattern Recognition, 40(1):346–349, 2007

work page 2007

[45] [45]

Completely lazy learning.IEEE Transactions on Knowledge and Data Engineering, 22(9):1274–1285, 2009

Eric K Garcia, Sergey Feldman, Maya R Gupta, and Santosh Srivastava. Completely lazy learning.IEEE Transactions on Knowledge and Data Engineering, 22(9):1274–1285, 2009

work page 2009

[46] [46]

Mul- tidimensional scaling, sammon mapping, and isomap

Benyamin Ghojogh, Mark Crowley, Fakhri Karray, and Ali Ghodsi. Mul- tidimensional scaling, sammon mapping, and isomap. InElements of Di- mensionality Reduction and Manifold Learning, pages 185–205. Springer, 2023

work page 2023

[47] [47]

Neighbourhood components analysis.Advances in neural informa- tion processing systems, 17, 2004

Jacob Goldberger, Geoffrey E Hinton, Sam Roweis, and Russ R Salakhut- dinov. Neighbourhood components analysis.Advances in neural informa- tion processing systems, 17, 2004

work page 2004

[48] [48]

A local mean-based k-nearest centroid neighbor classifier.The Computer Journal, 55(9):1058– 1071, 01 2011

Jianping Gou, Zhang Yi, Lan Du, and Taisong Xiong. A local mean-based k-nearest centroid neighbor classifier.The Computer Journal, 55(9):1058– 1071, 01 2011

work page 2011

[49] [49]

Springer, 2018

Artur Gramacki.Nonparametric Kernel Density Estimation and Its Com- putational Aspects. Springer, 2018

work page 2018

[50] [50]

Borgwardt, Malte J

Arthur Gretton, Karsten M. Borgwardt, Malte J. Rasch, Bernhard Sch¨ olkopf, and Alex Smola. A kernel two-sample test.J. Mach. Learn. Res., 13:723–773, 2012

work page 2012

[51] [51]

Hastie, R

T. Hastie, R. Tibshirani, and J.H. Friedman.The Elements of Statisti- cal Learning: Data Mining, Inference, and Prediction. Springer series in statistics. Springer, 2009

work page 2009

[52] [52]

Locality preserving projections

Xiaofei He and Partha Niyogi. Locality preserving projections. InPro- ceedings of the 16th International Conference on Neural Information Pro- cessing Systems, page 153–160, Cambridge, MA, USA, 2003. MIT Press

work page 2003

[53] [53]

Hinton and Sam T

Geoffrey E. Hinton and Sam T. Roweis. Stochastic neighbor embedding. InNeural Information Processing Systems, 2002. 67

work page 2002

[54] [54]

N. L. Hjort and M. C. Jones. Locally parametric nonparametric density estimation.The Annals of Statistics, 24(4):1619–1647, 1996

work page 1996

[55] [55]

Jonathan Ho, Ajay Jain, and P. Abbeel. Denoising diffusion probabilistic models.ArXiv, abs/2006.11239, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2006

[56] [56]

Thomas Hofmann, Bernhard Scholkopf, and Alexander J. Smola. Kernel methods in machine learning.The Annals of Statistics, 36(3):1171–1220, 2008

work page 2008

[57] [57]

Neural networks and physical systems with emergent collective computational abilities.Proceedings of the national academy of sciences, 79(8):2554–2558, 1982

John J Hopfield. Neural networks and physical systems with emergent collective computational abilities.Proceedings of the national academy of sciences, 79(8):2554–2558, 1982

work page 1982

[58] [58]

On the foundations of relaxation labeling processes.IEEE Transactions on Pattern Analysis and Machine Intelligence, (3):267–287, 1983

Robert A Hummel and Steven W Zucker. On the foundations of relaxation labeling processes.IEEE Transactions on Pattern Analysis and Machine Intelligence, (3):267–287, 1983

work page 1983

[59] [59]

Estimation of non-normalized statistical models by score matching.Journal of Machine Learning Research, 6(4):695–709, 2005

Aapo Hyv¨ arinen. Estimation of non-normalized statistical models by score matching.Journal of Machine Learning Research, 6(4):695–709, 2005

work page 2005

[60] [60]

Anfis: adaptive-network-based fuzzy inference system.IEEE transactions on systems, man, and cybernetics, 23(3):665–685, 1993

J-SR Jang. Anfis: adaptive-network-based fuzzy inference system.IEEE transactions on systems, man, and cybernetics, 23(3):665–685, 1993

work page 1993

[61] [61]

Dimension reduction by local principal component analysis.Neural computation, 9(7):1493–1516, 1997

Nandakishore Kambhatla and Todd K Leen. Dimension reduction by local principal component analysis.Neural computation, 9(7):1493–1516, 1997

work page 1997

[62] [62]

A review and compar- ison of bandwidth selection methods for kernel regression.International Statistical Review, 82(2):243–274, 2014

Max K¨ ohler, Anja Schindler, and Stefan Sperlich. A review and compar- ison of bandwidth selection methods for kernel regression.International Statistical Review, 82(2):243–274, 2014

work page 2014

[63] [63]

Dense associative memory for pattern recognition.Advances in neural information processing systems, 29:1172– 1180, 2016

Dmitry Krotov and John J Hopfield. Dense associative memory for pattern recognition.Advances in neural information processing systems, 29:1172– 1180, 2016

work page 2016

[64] [64]

Raja Kumar, P

R. Raja Kumar, P. Viswanath, and C. Shobha Bindu. Nearest neighbor classifiers: A review.International Journal of Computational Intelligence Research, 13(2):303–311, 2017

work page 2017

[65] [65]

Set Transformer: A Framework for Attention-based Permutation-Invariant Neural Networks

Juho Lee, Yoonho Lee, Jungtaek Kim, Adam R. Kosiorek, Seungjin Choi, and Yee Whye Teh. Set transformer.ArXiv, abs/1810.00825, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[66] [66]

A note on the convergence of the mean shift.Pattern Recognition, 40(6):1756–1762, 2007

Xiangru Li, Zhanyi Hu, and Fuchao Wu. A note on the convergence of the mean shift.Pattern Recognition, 40(6):1756–1762, 2007

work page 2007

[67] [67]

Clive R. Loader. Local likelihood density estimation.The Annals of Statistics, 24(4):1602–1618, 1996

work page 1996

[68] [68]

Non-negative lapla- cian embedding

Dijun Luo, Chris Ding, Heng Huang, and Tao Li. Non-negative lapla- cian embedding. In2009 Ninth IEEE International Conference on Data Mining, pages 337–346. IEEE, 2009. 68

work page 2009

[69] [69]

Thang Luong, Hieu Pham, and Christopher D. Manning. Effec- tive approaches to attention-based neural machine translation.ArXiv, abs/1508.04025, 2015

work page internal anchor Pith review Pith/arXiv arXiv 2015

[70] [70]

Interpretation and generalization of score matching

Siwei Lyu. Interpretation and generalization of score matching. 2012

work page 2012

[71] [71]

Macaulay

Frederick R. Macaulay. Introduction to ”the smoothing of time series”. In The Smoothing of Time Series, pages 17–30. National Bureau of Economic Research, Inc, 1931

work page 1931

[72] [72]

Filtered kernel density estimation.Matrix, 1994

David J Marchette, Carey E Priebe, George W Rogers, and Jeffrey L Solka. Filtered kernel density estimation.Matrix, 1994

work page 1994

[73] [73]

The racing algorithm: Model selec- tion for lazy learners.Artificial Intelligence Review, 11:193–225, 1997

Oden Maron and Andrew W Moore. The racing algorithm: Model selec- tion for lazy learners.Artificial Intelligence Review, 11:193–225, 1997

work page 1997

[74] [74]

Theory of edge detection.Proceedings of the Royal Society of London

David Marr and Ellen Hildreth. Theory of edge detection.Proceedings of the Royal Society of London. Series B. Biological Sciences, 207(1167):187– 217, 1980

work page 1980

[75] [75]

UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction

Leland McInnes, John Healy, and James Melville. Umap: Uniform mani- fold approximation and projection for dimension reduction.arXiv preprint arXiv:1802.03426, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[76] [76]

Incremental local gaussian regression.Advances in Neural Information Processing Systems, 27, 2014

Franziska Meier, Philipp Hennig, and Stefan Schaal. Incremental local gaussian regression.Advances in Neural Information Processing Systems, 27, 2014

work page 2014

[77] [77]

Fuzzy logic systems for engineering: a tutorial.Proceed- ings of the IEEE, 83(3):345–377, 1995

Jerry M Mendel. Fuzzy logic systems for engineering: a tutorial.Proceed- ings of the IEEE, 83(3):345–377, 1995

work page 1995

[78] [78]

Con- crete score matching: generalized score matching for discrete data

Chenlin Meng, Kristy Choi, Jiaming Song, and Stefano Ermon. Con- crete score matching: generalized score matching for discrete data. In Proceedings of the 36th International Conference on Neural Information Processing Systems, NIPS ’22, Red Hook, NY, USA, 2022. Curran Asso- ciates Inc

work page 2022

[79] [79]

Transmla: Multi-head latent attention is all you need, 2025

Fanxu Meng, Pingzhi Tang, Zengwei Yao, and Muhan Zhang. Transmla: Multi-head latent attention is all you need, 2025

work page 2025

[80] [80]

Micchelli, Y

C. Micchelli, Y. Xu, and H. Zhang. Universal kernels.Journal of Machine Learning Research, 7:2651–2667, 2006

work page 2006