The General Theory of Localization Methods
Pith reviewed 2026-05-21 06:30 UTC · model grok-4.3
The pith
The localization method unifies many machine learning models and reconstructs the Transformer from hierarchical local models.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Defining the local model and applying the localization trick yields a framework that reinterprets kernel methods, lazy learning, MeanShift, relaxation labeling, Hopfield networks, local linear embedding, fuzzy inference, and denoising autoencoders as instances of localization while extending the same structure to adaptive kernels, hierarchical local models, and the construction of Transformers.
What carries the argument
The localization trick, which weights local means by localization kernels to produce localized models that generalize self-attention.
If this is right
- Hopfield networks and denoising autoencoders become special cases of localized models.
- Transformers arise directly from stacking hierarchical local models.
- Adaptive kernels and non-local extensions fit inside the same formal structure.
- New data-adaptive systems can be designed by varying the choice of localization kernel.
Where Pith is reading between the lines
- The same local-means construction might extend naturally to graph-structured or spatial data beyond sequences.
- Focusing computation on local kernels could suggest efficiency gains for attention-based models on very long inputs.
- Links to fuzzy inference might allow the framework to produce more interpretable decision rules in classification tasks.
Load-bearing premise
The local model and localization trick together suffice to connect and generalize the listed existing models without omitting their essential behaviors.
What would settle it
A step-by-step derivation that recovers the exact standard Transformer attention equations from repeated applications of the localization trick on hierarchical local models would confirm the claim; failure to recover those equations exactly would refute it.
read the original abstract
This paper proposes a general machine learning framework called the localization method, which is fundamentally built on two core concepts: localization kernels and local means -- key components that underpin the self-attention mechanism. To establish a rigorous theoretical foundation, the framework is formally defined through two essential pillars: the formulation of the local(-ized) model and the localization trick. We systematically investigate the connections between the localization method and a wide range of existing machine learning models/methods, including (but not limited to) kernel methods, lazy learning, the MeanShift algorithm, relaxation labeling, Hopfield networks, local linear embedding (LLE), fuzzy inference, and denoising autoencoders (DAEs). By dissecting these relationships, we clarify the broader theoretical significance of the localization method and demonstrate its practical applicability across diverse machine learning tasks. Furthermore, we explore advanced extensions of the framework, such as adaptive kernels, hierarchical local models, and non-local models. Notably, we show that the Transformer -- a cornerstone of modern sequence modeling -- can be constructed using hierarchical local models, revealing the ability of the localization method to unify and generalize state-of-the-art architectures. This work not only provides a unified theoretical lens to reinterpret existing models but also offers new methodological tools for designing flexible, data-adaptive learning systems.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces the localization method as a general framework for machine learning, founded on localization kernels and local means. It establishes the framework through the local model and the localization trick, explores connections to kernel methods, lazy learning, MeanShift, relaxation labeling, Hopfield networks, LLE, fuzzy inference, and denoising autoencoders. It further develops extensions including adaptive kernels, hierarchical local models, and non-local models, and demonstrates that the Transformer architecture can be derived from hierarchical local models.
Significance. Should the connections and the Transformer construction be rigorously derived without circularity or ad-hoc choices, this work could provide a valuable unifying theory for a broad range of machine learning models and architectures. The ability to reconstruct state-of-the-art models like Transformers from the localization framework would strengthen the case for its generality and offer new insights into designing adaptive learning systems.
major comments (2)
- [Section on hierarchical local models and Transformer construction] The construction of the Transformer using hierarchical local models must be shown to exactly reproduce the standard self-attention mechanism, including the query, key, value projections and the scaling factor. If the localization trick or kernels are defined in a way that incorporates these elements by construction, the unification claim requires clarification to avoid circularity.
- [Formulation of the local model and localization trick] The two pillars (local model formulation and localization trick) are presented as establishing a rigorous foundation, but it is unclear whether the definitions are independent of the models they aim to unify or if they are tailored to fit the connections. A concrete example deriving one of the listed models (e.g., Hopfield networks or MeanShift) from the core definitions would help assess if the framework is generative or descriptive.
minor comments (2)
- [Abstract] The abstract lists many connections but does not specify which are novel derivations versus reinterpretations; clarifying this would improve the presentation.
- [Notation] Ensure consistent use of notation for localization kernels and local means throughout the paper to avoid confusion with standard kernel methods.
Simulated Author's Rebuttal
We thank the referee for the constructive comments and the recommendation for major revision. We address the concerns regarding the rigor of the Transformer derivation and the independence of the core definitions below, and have revised the manuscript accordingly to provide explicit derivations and examples.
read point-by-point responses
-
Referee: [Section on hierarchical local models and Transformer construction] The construction of the Transformer using hierarchical local models must be shown to exactly reproduce the standard self-attention mechanism, including the query, key, value projections and the scaling factor. If the localization trick or kernels are defined in a way that incorporates these elements by construction, the unification claim requires clarification to avoid circularity.
Authors: We agree that an explicit, step-by-step derivation is required to substantiate the claim. In the revised manuscript we have expanded the relevant section to derive the standard self-attention mechanism directly from the hierarchical local-model construction. The query, key, and value projections arise from the specific choice of local models at each level of the hierarchy, while the scaling factor follows from the normalization inherent in the localization kernel; neither is presupposed in the general definitions. The core localization kernels and local means are introduced independently of any target architecture, and the Transformer is obtained as one particular hierarchical specialization. We have added a clarifying paragraph stating that the framework is therefore not circular. revision: yes
-
Referee: [Formulation of the local model and localization trick] The two pillars (local model formulation and localization trick) are presented as establishing a rigorous foundation, but it is unclear whether the definitions are independent of the models they aim to unify or if they are tailored to fit the connections. A concrete example deriving one of the listed models (e.g., Hopfield networks or MeanShift) from the core definitions would help assess if the framework is generative or descriptive.
Authors: To demonstrate that the framework is generative, the revised manuscript now contains a fully worked derivation of the MeanShift algorithm starting from the general local-model formulation and localization trick. The derivation begins with the abstract definitions of localization kernels and local means and arrives at the standard MeanShift update without additional assumptions. A shorter outline deriving the Hopfield network is also supplied. These examples illustrate that the two pillars are formulated from first principles of localization and are not tailored to the models they later recover. revision: yes
Circularity Check
No significant circularity; framework definitions enable explicit mappings to prior models without reducing to inputs by construction.
full rationale
The paper introduces the localization method via two pillars (local model formulation and localization trick) that are defined independently of the specific models they later connect to. It then derives connections to kernel methods, MeanShift, Hopfield networks, LLE, and others through explicit reformulations, and constructs the Transformer as a special case of hierarchical local models. These steps rely on general definitions of kernels and local means rather than fitting parameters to target outputs or invoking self-citations for uniqueness. No equation or claim reduces to a tautology or renamed input; the unification follows from applying the core concepts to each architecture. The derivation remains self-contained against external benchmarks like standard self-attention equations.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption The localization method is formally defined through the formulation of the local(-ized) model and the localization trick.
invented entities (2)
-
localization kernels
no independent evidence
-
local means
no independent evidence
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
local loss J(x*,θ) := ∑ K(x*,xi) l(xi,θ); local mean ŷ(x*) = ∑ K(x*,xi) yi / ∑ K(x*,xi); temporal kernel for self-attention K(xt,t,xs,s)
-
IndisputableMonolith/Foundation/AlexanderDuality.leanalexander_duality_circle_linking unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
hierarchical local models construct Transformer via localization trick on simple models
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Saliency detection using maximum symmetric surround
Radhakrishna Achanta and Sabine S¨ usstrunk. Saliency detection using maximum symmetric surround. In2010 IEEE International Conference on Image Processing, pages 2653–2656. IEEE, 2010
work page 2010
-
[2]
Instance-based learning algorithms.Machine learning, 6:37–66, 1991
David W Aha, Dennis Kibler, and Marc K Albert. Instance-based learning algorithms.Machine learning, 6:37–66, 1991
work page 1991
-
[3]
J. Ainslie, J. Lee Thorp, M. De Jong, Y. Zemlyanskiy, F. Lebr´ on, and S. Sanghai. Gqa: Training generalized multi-query transformer models from multi-head checkpoints, 2023
work page 2023
-
[4]
Guillaume Alain, Yoshua Bengio, Li Yao, Jason Yosinski, Eric Thibodeau- Laufer, Saizheng Zhang, and Pascal Vincent. Gsns: generative stochastic networks.Information and Inference: A Journal of the IMA, 5(2):210– 249, 2016
work page 2016
-
[5]
Hebbian learning from first principles.Journal of Mathe- matical Physics, 65(11):113302, 2024
Linda Albanese, Adriano Barra, Pierluigi Bianco, Fabrizio Durante, and Diego Pallara. Hebbian learning from first principles.Journal of Mathe- matical Physics, 65(11):113302, 2024
work page 2024
-
[6]
Youness Aliyari Ghassabeh and Frank Rudzicz. The mean shift algorithm and its relation to kernel regression.Information Sciences, 348:198–208, 2016
work page 2016
-
[7]
Ehsan Amid and Manfred K Warmuth. Trimap: Large-scale dimension- ality reduction using triplets.arXiv preprint arXiv:1910.00204, 2019
-
[8]
Gilles Aubert, Pierre Kornprobst, and Giles Aubert.Mathematical prob- lems in image processing: partial differential equations and the calculus of variations, volume 147. Springer, 2006
work page 2006
-
[9]
Analytic-dpm: an analytic estimate of the optimal reverse variance in diffusion probabilistic models
Fan Bao, Chongxuan Li, Jun Zhu, and Bo Zhang. Analytic-dpm: an analytic estimate of the optimal reverse variance in diffusion probabilistic models. 2022
work page 2022
-
[10]
Laplacian eigenmaps and spectral techniques for embedding and clustering
Mikhail Belkin and Partha Niyogi. Laplacian eigenmaps and spectral techniques for embedding and clustering. In T. Dietterich, S. Becker, and Z. Ghahramani, editors,Advances in Neural Information Processing Systems, volume 14. MIT Press, 2001
work page 2001
-
[11]
P. K. Bhattacharya. Estimation of a probability density function and its derivatives.Sankhy¯ a: The Indian Journal of Statistics, Series A (1961- 2002), 29(4):373–382, 1967
work page 1961
-
[12]
Gianluca Bontempi, Mauro Birattari, and Hugues Bersini. Lazy learning for local modelling and control design.International Journal of Control, 72(7-8):643–658, 1999. 64
work page 1999
-
[13]
Nonparametric density estimation via diffusion mixing
Zdravko Botev. Nonparametric density estimation via diffusion mixing. Technical report, University of Queensland, 2007
work page 2007
-
[14]
Breunig, Hans-Peter Kriegel, Raymond T
Markus M. Breunig, Hans-Peter Kriegel, Raymond T. Ng, and J¨ org Sander. Lof: identifying density-based local outliers. InProceedings of the 2000 ACM SIGMOD International Conference on Management of Data, SIGMOD ’00, page 93–104, New York, NY, USA, 2000. Association for Computing Machinery
work page 2000
-
[15]
A review of image denoising algorithms, with a new one.Multiscale Model
Antoni Buades, Bartomeu Coll, and Jean-Michel Morel. A review of image denoising algorithms, with a new one.Multiscale Model. Simul., 4:490– 530, 2005
work page 2005
-
[16]
Neighborhood filters and pde’s.Numerische Mathematik, 105:1–34, 2006
Antoni Buades, Bartomeu Coll, and Jean-Michel Morel. Neighborhood filters and pde’s.Numerische Mathematik, 105:1–34, 2006
work page 2006
-
[17]
Vici- nal risk minimization.Advances in neural information processing systems, 13, 2000
Olivier Chapelle, Jason Weston, L´ eon Bottou, and Vladimir Vapnik. Vici- nal risk minimization.Advances in neural information processing systems, 13, 2000
work page 2000
-
[18]
Lisha Chen and Andreas Buja. Local multidimensional scaling for nonlin- ear dimension reduction, graph drawing, and proximity analysis.Journal of the American Statistical Association, 104:209 – 219, 2009
work page 2009
-
[19]
Yen-Chi Chen. A tutorial on kernel density estimation and recent ad- vances.Biostatistics & Epidemiology, 1(1):161–187, 2017
work page 2017
-
[20]
Haibin Cheng, Pang-Ning Tan, and Rong Jin. Efficient algorithm for localized support vector machine.IEEE Transactions on Knowledge and Data Engineering, 22(4):537–549, 2009
work page 2009
-
[21]
Yizong Cheng. Mean shift, mode seeking, and clustering.IEEE transac- tions on pattern analysis and machine intelligence, 17(8):790–799, 1995
work page 1995
-
[22]
MIT Press, Cambridge, MA, 1965
Noam Chomsky.Aspects of the Theory of Syntax. MIT Press, Cambridge, MA, 1965
work page 1965
-
[23]
MIT Press, Cambridge, MA, 1995
Noam Chomsky.The Minimalist Program. MIT Press, Cambridge, MA, 1995
work page 1995
- [24]
-
[25]
William S. Cleveland and Susan J. Devlin. Locally-weighted regression: an approach to regression analysis by local fitting.Journal of the American Statistical Association, 83:596–610, 1988
work page 1988
-
[26]
D. Comaniciu, V. Ramesh, and P. Meer. Kernel-based object track- ing.IEEE Transactions on Pattern Analysis and Machine Intelligence, 25(5):564–577, 2003. 65
work page 2003
-
[27]
Mean shift analysis and applications
Dorin Comaniciu and Peter Meer. Mean shift analysis and applications. In Proceedings of the Seventh IEEE International Conference on Computer Vision, volume 2, pages 1197–1203. IEEE, 1999
work page 1999
-
[28]
Dorin Comaniciu and Peter Meer. Mean shift: A robust approach to- ward feature space analysis.IEEE Transactions on Pattern Analysis and Machine Intelligence, 24(5):603–619, 2002
work page 2002
-
[29]
Nearest neighbor pattern classification
Thomas Cover and Peter Hart. Nearest neighbor pattern classification. IEEE transactions on information theory, 13(1):21–27, 1967
work page 1967
-
[30]
Transformer-xl: Attentive language models beyond a fixed-length context
Zihang Dai, Zhilin Yang, Yiming Yang, William W Cohen, Jaime Car- bonell, Quoc V Le, and Ruslan Salakhutdinov. Transformer-xl: Attentive language models beyond a fixed-length context. 2019
work page 2019
-
[31]
On conditional density estimation
Jan G De Gooijer and Dawit Zerom. On conditional density estimation. Statistica Neerlandica, 57(2):159–176, 2003
work page 2003
-
[32]
The road from mle to em to vae: A brief tutorial.AI Open, 2022(3):29–34, 2022
Ming Ding. The road from mle to em to vae: A brief tutorial.AI Open, 2022(3):29–34, 2022
work page 2022
-
[33]
Random graph modeling: A survey of the concepts.ACM computing surveys (CSUR), 52(6):1–36, 2019
Mikhail Drobyshevskiy and Denis Turdakov. Random graph modeling: A survey of the concepts.ACM computing surveys (CSUR), 52(6):1–36, 2019
work page 2019
-
[34]
Bradley Efron. Tweedie’s formula and selection bias.Journal of the Amer- ican Statistical Association, 106(496):1602–1614, 2011
work page 2011
-
[35]
Shinto Eguchi, Tae Yoon Kim, and Byeong U. Park. Local likelihood method: A bridge over parametric and nonparametric regression.Non- parametric Statistics, 15(6):665–683, 2003
work page 2003
-
[36]
Ezekiel.Methods of Correlation Analysis
M. Ezekiel.Methods of Correlation Analysis. John Wiley & Sons, New York, 2nd edition, 1941
work page 1941
- [37]
-
[38]
Jianqing Fan, Mark Farmen, and Irene Gijbels. Local maximum likelihood estimation and inference.Journal of the American Statistical Association, 60:591–608, 1998
work page 1998
-
[39]
Chapman and Hall, London, 1996
Jianqing Fan and Ir` ene Gijbels.Local Polynomial Modeling and its Ap- plications. Chapman and Hall, London, 1996
work page 1996
-
[40]
Zizhu Fan, Yong Xu, and David Zhang. Local linear discriminant anal- ysis framework using sample neighbors.IEEE Transactions on Neural Networks, 22(7):1119–1132, 2011. 66
work page 2011
-
[41]
Evelyn Fix and Joseph L. Hodges. Discriminatory analysis. nonparametric discrimination: Consistency properties. Report, USAF School of Aviation Medicine, Randolph Field, Texas, 1951. Archived (PDF) from the original on September 26, 2020
work page 1951
-
[42]
R. Franke and G. Nielson. Smooth interpolation of large data sets of scattered data.Int. J. Numer. Methods Eng., 15:1691–1704, 1980
work page 1980
- [43]
-
[44]
Center-based nearest neighbor clas- sifier.Pattern Recognition, 40(1):346–349, 2007
Qing-Bin Gao and Zheng-Zhi Wang. Center-based nearest neighbor clas- sifier.Pattern Recognition, 40(1):346–349, 2007
work page 2007
-
[45]
Completely lazy learning.IEEE Transactions on Knowledge and Data Engineering, 22(9):1274–1285, 2009
Eric K Garcia, Sergey Feldman, Maya R Gupta, and Santosh Srivastava. Completely lazy learning.IEEE Transactions on Knowledge and Data Engineering, 22(9):1274–1285, 2009
work page 2009
-
[46]
Mul- tidimensional scaling, sammon mapping, and isomap
Benyamin Ghojogh, Mark Crowley, Fakhri Karray, and Ali Ghodsi. Mul- tidimensional scaling, sammon mapping, and isomap. InElements of Di- mensionality Reduction and Manifold Learning, pages 185–205. Springer, 2023
work page 2023
-
[47]
Neighbourhood components analysis.Advances in neural informa- tion processing systems, 17, 2004
Jacob Goldberger, Geoffrey E Hinton, Sam Roweis, and Russ R Salakhut- dinov. Neighbourhood components analysis.Advances in neural informa- tion processing systems, 17, 2004
work page 2004
-
[48]
Jianping Gou, Zhang Yi, Lan Du, and Taisong Xiong. A local mean-based k-nearest centroid neighbor classifier.The Computer Journal, 55(9):1058– 1071, 01 2011
work page 2011
-
[49]
Artur Gramacki.Nonparametric Kernel Density Estimation and Its Com- putational Aspects. Springer, 2018
work page 2018
-
[50]
Arthur Gretton, Karsten M. Borgwardt, Malte J. Rasch, Bernhard Sch¨ olkopf, and Alex Smola. A kernel two-sample test.J. Mach. Learn. Res., 13:723–773, 2012
work page 2012
- [51]
-
[52]
Locality preserving projections
Xiaofei He and Partha Niyogi. Locality preserving projections. InPro- ceedings of the 16th International Conference on Neural Information Pro- cessing Systems, page 153–160, Cambridge, MA, USA, 2003. MIT Press
work page 2003
-
[53]
Geoffrey E. Hinton and Sam T. Roweis. Stochastic neighbor embedding. InNeural Information Processing Systems, 2002. 67
work page 2002
-
[54]
N. L. Hjort and M. C. Jones. Locally parametric nonparametric density estimation.The Annals of Statistics, 24(4):1619–1647, 1996
work page 1996
-
[55]
Jonathan Ho, Ajay Jain, and P. Abbeel. Denoising diffusion probabilistic models.ArXiv, abs/2006.11239, 2020
work page internal anchor Pith review Pith/arXiv arXiv 2006
-
[56]
Thomas Hofmann, Bernhard Scholkopf, and Alexander J. Smola. Kernel methods in machine learning.The Annals of Statistics, 36(3):1171–1220, 2008
work page 2008
-
[57]
John J Hopfield. Neural networks and physical systems with emergent collective computational abilities.Proceedings of the national academy of sciences, 79(8):2554–2558, 1982
work page 1982
-
[58]
Robert A Hummel and Steven W Zucker. On the foundations of relaxation labeling processes.IEEE Transactions on Pattern Analysis and Machine Intelligence, (3):267–287, 1983
work page 1983
-
[59]
Aapo Hyv¨ arinen. Estimation of non-normalized statistical models by score matching.Journal of Machine Learning Research, 6(4):695–709, 2005
work page 2005
-
[60]
J-SR Jang. Anfis: adaptive-network-based fuzzy inference system.IEEE transactions on systems, man, and cybernetics, 23(3):665–685, 1993
work page 1993
-
[61]
Dimension reduction by local principal component analysis.Neural computation, 9(7):1493–1516, 1997
Nandakishore Kambhatla and Todd K Leen. Dimension reduction by local principal component analysis.Neural computation, 9(7):1493–1516, 1997
work page 1997
-
[62]
Max K¨ ohler, Anja Schindler, and Stefan Sperlich. A review and compar- ison of bandwidth selection methods for kernel regression.International Statistical Review, 82(2):243–274, 2014
work page 2014
-
[63]
Dmitry Krotov and John J Hopfield. Dense associative memory for pattern recognition.Advances in neural information processing systems, 29:1172– 1180, 2016
work page 2016
-
[64]
R. Raja Kumar, P. Viswanath, and C. Shobha Bindu. Nearest neighbor classifiers: A review.International Journal of Computational Intelligence Research, 13(2):303–311, 2017
work page 2017
-
[65]
Set Transformer: A Framework for Attention-based Permutation-Invariant Neural Networks
Juho Lee, Yoonho Lee, Jungtaek Kim, Adam R. Kosiorek, Seungjin Choi, and Yee Whye Teh. Set transformer.ArXiv, abs/1810.00825, 2018
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[66]
A note on the convergence of the mean shift.Pattern Recognition, 40(6):1756–1762, 2007
Xiangru Li, Zhanyi Hu, and Fuchao Wu. A note on the convergence of the mean shift.Pattern Recognition, 40(6):1756–1762, 2007
work page 2007
-
[67]
Clive R. Loader. Local likelihood density estimation.The Annals of Statistics, 24(4):1602–1618, 1996
work page 1996
-
[68]
Non-negative lapla- cian embedding
Dijun Luo, Chris Ding, Heng Huang, and Tao Li. Non-negative lapla- cian embedding. In2009 Ninth IEEE International Conference on Data Mining, pages 337–346. IEEE, 2009. 68
work page 2009
-
[69]
Thang Luong, Hieu Pham, and Christopher D. Manning. Effec- tive approaches to attention-based neural machine translation.ArXiv, abs/1508.04025, 2015
work page internal anchor Pith review Pith/arXiv arXiv 2015
-
[70]
Interpretation and generalization of score matching
Siwei Lyu. Interpretation and generalization of score matching. 2012
work page 2012
- [71]
-
[72]
Filtered kernel density estimation.Matrix, 1994
David J Marchette, Carey E Priebe, George W Rogers, and Jeffrey L Solka. Filtered kernel density estimation.Matrix, 1994
work page 1994
-
[73]
Oden Maron and Andrew W Moore. The racing algorithm: Model selec- tion for lazy learners.Artificial Intelligence Review, 11:193–225, 1997
work page 1997
-
[74]
Theory of edge detection.Proceedings of the Royal Society of London
David Marr and Ellen Hildreth. Theory of edge detection.Proceedings of the Royal Society of London. Series B. Biological Sciences, 207(1167):187– 217, 1980
work page 1980
-
[75]
UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction
Leland McInnes, John Healy, and James Melville. Umap: Uniform mani- fold approximation and projection for dimension reduction.arXiv preprint arXiv:1802.03426, 2018
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[76]
Incremental local gaussian regression.Advances in Neural Information Processing Systems, 27, 2014
Franziska Meier, Philipp Hennig, and Stefan Schaal. Incremental local gaussian regression.Advances in Neural Information Processing Systems, 27, 2014
work page 2014
-
[77]
Fuzzy logic systems for engineering: a tutorial.Proceed- ings of the IEEE, 83(3):345–377, 1995
Jerry M Mendel. Fuzzy logic systems for engineering: a tutorial.Proceed- ings of the IEEE, 83(3):345–377, 1995
work page 1995
-
[78]
Con- crete score matching: generalized score matching for discrete data
Chenlin Meng, Kristy Choi, Jiaming Song, and Stefano Ermon. Con- crete score matching: generalized score matching for discrete data. In Proceedings of the 36th International Conference on Neural Information Processing Systems, NIPS ’22, Red Hook, NY, USA, 2022. Curran Asso- ciates Inc
work page 2022
-
[79]
Transmla: Multi-head latent attention is all you need, 2025
Fanxu Meng, Pingzhi Tang, Zengwei Yao, and Muhan Zhang. Transmla: Multi-head latent attention is all you need, 2025
work page 2025
-
[80]
C. Micchelli, Y. Xu, and H. Zhang. Universal kernels.Journal of Machine Learning Research, 7:2651–2667, 2006
work page 2006
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.