pith. sign in

arxiv: 2407.12208 · v2 · submitted 2024-07-16 · 🧮 math.NA · cs.NA

Computing k-means in mixed precision

Pith reviewed 2026-05-23 22:31 UTC · model grok-4.3

classification 🧮 math.NA cs.NA
keywords k-meansmixed precisionnumerical stabilityLloyd's algorithmdistance computationdata clusteringimage segmentation
0
0 comments X

The pith

Lloyd's k-means remains stable when distance computations drop to lower precision.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that the standard distance formula used inside Lloyd's algorithm is numerically stable, then builds a mixed-precision framework that computes distances in reduced precision while keeping other steps in higher precision. Simulations on clustering tasks and image segmentation show that the approach produces results comparable to full precision, with normalized data tolerating the precision drop more readily than unnormalized data. A sympathetic reader would care because modern hardware offers fast low-precision units that could cut runtime and energy use for one of the most common clustering methods. The work therefore supplies concrete evidence that mixed-precision kernels can be swapped into k-means without immediate loss of correctness.

Core claim

We confirm the stability of the widely used distance computation formula. We propose a mixed-precision framework for k-means computation and investigate the effects of low-precision distance computation within the framework. Through extensive simulations on various data clustering and image segmentation tasks, we verify the applicability and robustness of the mixed precision k-means method. We find that, in k-means computation, normalized data is more tolerant to the reduction of precision in the distance computation, while for unnormalized data more care is needed in the use of reduced precision, mainly to avoid overflow.

What carries the argument

Mixed-precision framework that performs distance calculations in low precision inside Lloyd's iteration while retaining higher precision for other operations.

If this is right

  • Normalized input data tolerates lower precision in distance kernels with little change in clustering quality.
  • Unnormalized data requires safeguards against overflow when reduced precision is used for distances.
  • The mixed-precision distance kernel can be reused to accelerate other distance-based machine-learning routines.
  • Hardware support for low-precision arithmetic offers a practical route to faster k-means execution.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If the stability holds beyond the tested cases, then k-means implementations on GPUs or AI accelerators could default to mixed-precision distances for an immediate performance gain.
  • The same distance-kernel substitution might be tested on streaming or online variants of k-means where data arrives continuously.
  • A natural next measurement is the exact speedup and energy saving on specific mixed-precision hardware when the framework is implemented in a production library.

Load-bearing premise

The chosen simulation datasets and tasks represent the numerical behavior that will appear in other k-means workloads.

What would settle it

A dataset and cluster count where switching only the distance kernel to low precision produces measurably different final assignments or higher within-cluster sum of squares than the full-precision run.

Figures

Figures reproduced from arXiv: 2407.12208 by Erin Carson, Xiaobo Liu, Xinye Chen.

Figure 5.1
Figure 5.1. Figure 5.1: The difference measured as (5.6) of the two distance computing formulae (4.3) and (4.4) in double precision. and so the only worrying case implied from the bound is when (x − y) T (x − y) = x T x − 2x T y + y T y = d 2 r ≪ ∥y∥ 2 2 , ∥x∥2 ≈ ∥y∥2, which can happen only when the angle θ between x and y is close to 0 (the two vectors are aligned). However, we know from (5.4) that the forward error in computi… view at source ↗
Figure 6.1
Figure 6.1. Figure 6.1: The triggered rate η of low precision computations in terms of δ on synthetic Gaussian blobs data of different deviation σ, where 2,000 Gaussian data points with 10 clusters (blobs) are generated. 6.2. Simulations with various δ To gauge how many computations have been performed in a lower precision in computing the distance between a center and data points via Algorithm 6.2, we define and report when ne… view at source ↗
Figure 6.2
Figure 6.2. Figure 6.2: The performance in terms of ARI and AMI of Algorithm [PITH_FULL_IMAGE:figures/full_fig_p012_6_2.png] view at source ↗
Figure 7.1
Figure 7.1. Figure 7.1: Visualization of the S-sets (colors marked as ground truth clusters). [PITH_FULL_IMAGE:figures/full_fig_p014_7_1.png] view at source ↗
Figure 7.2
Figure 7.2. Figure 7.2: Tasks in images selected from ImageNet [PITH_FULL_IMAGE:figures/full_fig_p018_7_2.png] view at source ↗
Figure 7.3
Figure 7.3. Figure 7.3: Results of image segmentation for k-means++ in the working precision (fp64). phenomenon in the context of Euclidean distance computation is desired. Another future research direction could be providing stronger theoretical support for the use of mixed precision in distance computation and developing a more systematic approach of low precision switching mechanism. Acknowledgements The first and second aut… view at source ↗
Figure 7.4
Figure 7.4. Figure 7.4: Results of image segmentation I (fp16). [2] H. Jegou, M. Douze, C. Schmid, Product quantization for nearest neighbor search, IEEE Transactions on Pattern Analysis and Machine ´ Intelligence 33 (1) (2010) 117–128. doi:10.1109/TPAMI.2010.57. [3] M. Norouzi, D. J. Fleet, Cartesian K-means, in: 2013 IEEE Conference on Computer Vision and Pattern Recognition, IEEE, 2013, pp. 3017–3024. doi:10.1109/CVPR.2013.3… view at source ↗
Figure 7.5
Figure 7.5. Figure 7.5: Results of image segmentation I (q52). Proceedings of the 30th International Conference on Neural Information Processing Systems, Curran Associates Inc., New York, USA, 2016, pp. 1360–1368. doi:10.5555/3157096.3157248. [19] D. Sculley, Web-scale k-means clustering, in: WWW’10: Proceedings of the 19th international conference on World wide web, ACM Press, New York, USA, 2010, pp. 1177–1178. doi:10.1145/17… view at source ↗
Figure 7.6
Figure 7.6. Figure 7.6: Results of image segmentation II (fp16). [PITH_FULL_IMAGE:figures/full_fig_p021_7_6.png] view at source ↗
Figure 7.7
Figure 7.7. Figure 7.7: Results of image segmentation II (q52). URL https://dl.acm.org/doi/abs/10.5555/3524938.3525913 [48] J. Xu, X. Sun, Z. Zhang, G. Zhao, J. Lin, Understanding and improving layer normalization, in: H. M. Wallach, H. Larochelle, A. Beygelzimer, F. d’Alche-Buc, E. B. Fox (Eds.), NIPS’19: Proceedings of the 33rd International Conference on Neural Information Processing Systems, ´ Curran Associates Inc., 2019, … view at source ↗
read the original abstract

The k-means algorithm is one of the most popular and critical techniques in data mining and machine learning, and it has achieved significant success in numerous science and engineering domains. Computing k-means to a global optimum is NP-hard in Euclidean space, yet there are a variety of efficient heuristic algorithms, such as Lloyd's algorithm, that converge to a local optimum with superpolynomial complexity in the worst case. Motivated by the emergence and prominence of mixed precision capabilities in hardware, a current trend is to develop low and mixed precision variants of algorithms in order to improve the runtime and energy consumption. In this paper we study the numerical stability of Lloyd's k-means algorithm, and, in particular, we confirm the stability of the widely used distance computation formula. We propose a mixed-precision framework for k-means computation and investigate the effects of low-precision distance computation within the framework. Through extensive simulations on various data clustering and image segmentation tasks, we verify the applicability and robustness of the mixed precision k-means method. We find that, in k-means computation, normalized data is more tolerant to the reduction of precision in the distance computation, while for unnormalized data more care is needed in the use of reduced precision, mainly to avoid overflow. Our study demonstrates the potential for the use of mixed precision distance kernels to accelerate the k-means computation and offers insights into other distance-based machine learning methods.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper claims that the widely used distance computation formula in Lloyd's k-means is numerically stable, proposes a mixed-precision framework for k-means, and verifies the applicability and robustness of low-precision distance computation within this framework via extensive simulations on data clustering and image segmentation tasks. It concludes that normalized data tolerates reduced precision better than unnormalized data, which requires care mainly to avoid overflow, and highlights potential for accelerating k-means with mixed-precision kernels.

Significance. If the empirical results hold, the work provides concrete evidence that mixed-precision distance kernels can be used to accelerate k-means while preserving stability on the tested workloads, offering practical insights for other distance-based ML algorithms. The explicit distinction between normalized and unnormalized data behaviors, together with the empirical validation against external datasets, strengthens the contribution in the context of emerging mixed-precision hardware.

major comments (2)
  1. [Simulation and results sections] The central stability and robustness claims rest on the simulation results, yet the manuscript summarizes (rather than fully specifies) the exact precision formats, datasets, overflow handling, and error-bar analysis; this limits the ability to assess reproducibility and generality of the reported outcomes.
  2. [Discussion and conclusions] No analytic bounds or conditions are derived on data norms relative to low-precision exponent ranges or on accumulation error in the ||x||² + ||y||² – 2<x,y> formula under varying k or dimension; while the paper is empirical, the absence of such guidance leaves open the possibility of divergence outside the tested regimes.
minor comments (1)
  1. [Abstract and results] The abstract states that 'normalized data is more tolerant' but the corresponding quantitative comparison (e.g., failure rates or assignment differences) should be highlighted with a dedicated table or figure reference in the main text.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the positive assessment, the recommendation of minor revision, and the constructive comments on reproducibility and guidance. We address each major comment below.

read point-by-point responses
  1. Referee: [Simulation and results sections] The central stability and robustness claims rest on the simulation results, yet the manuscript summarizes (rather than fully specifies) the exact precision formats, datasets, overflow handling, and error-bar analysis; this limits the ability to assess reproducibility and generality of the reported outcomes.

    Authors: We agree that additional detail is required. In the revised manuscript we will add a new subsection that explicitly lists all precision formats employed (including FP32, FP16, and BF16), provides the full set of datasets with sources, dimensions, and preprocessing (normalization or lack thereof), describes the overflow-handling strategy (dynamic scaling when the exponent range is approached), and reports error bars obtained from multiple independent runs together with standard deviations. revision: yes

  2. Referee: [Discussion and conclusions] No analytic bounds or conditions are derived on data norms relative to low-precision exponent ranges or on accumulation error in the ||x||² + ||y||² – 2<x,y> formula under varying k or dimension; while the paper is empirical, the absence of such guidance leaves open the possibility of divergence outside the tested regimes.

    Authors: The work is deliberately empirical. We will expand the discussion section to supply practical, experiment-derived guidance on observed safe ranges of data norms relative to the tested exponent widths and to comment on accumulation behavior across the range of k and dimensions examined. We will also state explicitly that general analytic bounds are outside the paper’s scope and note the consequent limitation on extrapolation beyond the tested regimes. revision: partial

Circularity Check

0 steps flagged

No circularity: empirical validation against external datasets

full rationale

The paper's central claims rest on proposing a mixed-precision framework for Lloyd's k-means and confirming stability of the standard distance formula via direct numerical simulations on chosen clustering and segmentation datasets. No equations, fitted parameters, or predictions are defined in terms of the reported outcomes; the verification steps use external data and do not reduce to self-referential inputs or self-citation chains. The work is self-contained as empirical evidence without load-bearing derivations that collapse by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The work rests on standard floating-point arithmetic properties and the assumption that the chosen simulation workloads exercise the relevant numerical regimes; no new entities or fitted parameters are introduced in the abstract.

axioms (2)
  • standard math Standard properties of floating-point arithmetic (IEEE 754 rounding, overflow behavior) govern the distance computations.
    Invoked when discussing stability of the distance formula and overflow risk.
  • domain assumption The simulation data sets and tasks are representative of typical k-means workloads.
    Required to extrapolate from the reported experiments to general applicability.

pith-pipeline@v0.9.0 · 5773 in / 1310 out tokens · 17720 ms · 2026-05-23T22:31:09.317633+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

77 extracted references · 77 canonical work pages · 2 internal anchors

  1. [1]

    Gersho, R

    A. Gersho, R. M. Gray, Vector Quantization and Signal Compression, Springer-Verlag, New York, USA, 1991. doi:10.1007/ 978-1-4615-3626-0 . 18 (a) mp k-means++ low (b) mp k-means++ mp Figure 7.4: Results of image segmentation I (fp16)

  2. [2]

    J ´egou, M

    H. J ´egou, M. Douze, C. Schmid, Product quantization for nearest neighbor search, IEEE Transactions on Pattern Analysis and Machine Intelligence 33 (1) (2010) 117–128. doi:10.1109/TPAMI.2010.57

  3. [3]

    Norouzi, D

    M. Norouzi, D. J. Fleet, Cartesian K-means, in: 2013 IEEE Conference on Computer Vision and Pattern Recognition, IEEE, 2013, pp. 3017–3024. doi:10.1109/CVPR.2013.388

  4. [4]

    Jorda, A

    J. Jorda, A. V . Kajava, T-REKS: identification of Tandem REpeats in sequences with a K-meanS based algorithm, Bioinformatics 25 (20) (2009) 2632–2638. doi:10.1093/bioinformatics/btp482

  5. [5]

    P. A. Marin Zapata, S. Roth, D. Schmutzler, T. Wolf, E. Manesso, D. Clevert, Self-supervised feature extraction from image time series in plant phenotyping using triplet networks, Bioinformatics 37 (6) (2020) 861–867. doi:10.1093/bioinformatics/btaa905

  6. [6]

    J. H. Young, M. Peyton, H.-S. Kim, E. McMillan, J. D. Minna, M. A. White, E. M. Marcotte, Computational discovery of pathway-level genetic vulnerabilities in non-small-cell lung cancer, Bioinformatics 32 (9) (2016) 1373–1379. doi:10.1093/bioinformatics/btw010

  7. [7]

    Z. Chen, Q. Sun, Extracting class activation maps from non-discriminative features as well, in: 2023 IEEE /CVF Conference on Computer Vision and Pattern Recognition, IEEE, 2023, pp. 3135–3144. doi:10.1109/CVPR52729.2023.00306

  8. [8]

    Puzicha, T

    J. Puzicha, T. Hofmann, J. M. Buhmann, Histogram clustering for unsupervised image segmentation, in: Proceedings. 1999 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, IEEE, 1999, pp. 602–608. doi:10.1109/CVPR.1999.784981

  9. [9]

    Zhong, Z

    C. Zhong, Z. Sun, T. Tan, Robust 3D face recognition using learned visual codebook, in: IEEE Conference on Computer Vision and Pattern Recognition, IEEE, 2007. doi:10.1109/CVPR.2007.383279

  10. [10]

    S. R. Gaddam, V . V . Phoha, K. S. Balagani, K-means+ID3: A novel method for supervised anomaly detection by cascading k-means clustering and ID3 decision tree learning methods, IEEE Transactions on Knowledge and Data Engineering 19 (3) (2007) 345–354. doi:10.1109/ TKDE.2007.44

  11. [11]

    Gupta, R

    S. Gupta, R. Kumar, K. Lu, B. Moseley, S. Vassilvitskii, Local search methods for k-means with outliers, Proc. VLDB Endow. 10 (7) (2017) 757–768. doi:10.14778/3067421.3067425

  12. [12]

    Ordonez, E

    C. Ordonez, E. Omiecinski, E fficient disk-based k-means clustering for relational databases, IEEE Transactions on Knowledge and Data Engineering 16 (8) (2004) 909–921. doi:10.1109/TKDE.2004.25

  13. [13]

    S. Wang, Y . Sun, Z. Bao, On the e fficiency of K-means clustering: Evaluation, optimization, and algorithm selection, Proc. VLDB Endow. 14 (2) (2020) 163–175. doi:10.14778/3425879.3425887

  14. [14]

    S. Guo, N. Yao, Document vector extension for documents classification, IEEE Transactions on Knowledge and Data Engineering 33 (8) (2021) 3062–3074. doi:10.1109/TKDE.2019.2961343

  15. [15]

    Arthur, S

    D. Arthur, S. Vassilvitskii, How slow is the k-means method?, in: SCG’06: Proceedings of the 22nd annual symposium on Computational geometry, ACM Press, New York, USA, 2006, pp. 144–153. doi:10.1145/1137856.1137880

  16. [16]

    Arthur, S

    D. Arthur, S. Vassilvitskii, k-means++: the advantages of careful seeding, in: H. Gabow (Ed.), SODA’07: Proceedings of the 8th annual ACM-SIAM symposium on Discrete algorithms, Society for Industrial and Applied Mathematics, 2007, pp. 1027–1035. URL https://dl.acm.org/doi/10.5555/1283383.1283494

  17. [17]

    Bahmani, B

    B. Bahmani, B. Moseley, A. Vattani, R. Kumar, S. Vassilvitskii, Scalable k-means ++, Proc. VLDB Endow. 5 (7) (2012) 622–633. doi: 10.14778/2180912.2180915

  18. [18]

    Newling, F

    J. Newling, F. Fleuret, Nested mini-batch K-means, in: D. D. Lee, U. von Luxburg, R. Garnett, I. G. Masashi Sugiyama (Eds.), NIPS’16: 19 (a) mp k-means++ low (b) mp k-means++ mp Figure 7.5: Results of image segmentation I (q52). Proceedings of the 30th International Conference on Neural Information Processing Systems, Curran Associates Inc., New York, USA...

  19. [19]

    Sculley, Web-scale k-means clustering, in: WWW’10: Proceedings of the 19th international conference on World wide web, ACM Press, New York, USA, 2010, pp

    D. Sculley, Web-scale k-means clustering, in: WWW’10: Proceedings of the 19th international conference on World wide web, ACM Press, New York, USA, 2010, pp. 1177–1178. doi:10.1145/1772690.1772862

  20. [20]

    Ismkhan, I-k-means- +: An iterative clustering algorithm based on an enhanced version of the k-means, Pattern Recognition 79 (2018) 402–413

    H. Ismkhan, I-k-means- +: An iterative clustering algorithm based on an enhanced version of the k-means, Pattern Recognition 79 (2018) 402–413. doi:10.1016/j.patcog.2018.02.015

  21. [22]

    Fr ¨anti, S

    P. Fr ¨anti, S. Sieranoja, How much can k-means be improved by using better initialization and repeats?, Pattern Recognition 93 (C) (2019) 95–112. doi:10.1016/j.patcog.2019.04.014

  22. [23]

    Y . Xu, W. Qu, Z. Li, G. Min, K. Li, Z. Liu, E fficient k-means++ approximation with MapReduce, IEEE Transactions on Parallel and Distributed Systems 25 (12) (2014) 3135–3144. doi:10.1109/TPDS.2014.2306193

  23. [24]

    W. Zhao, H. Ma, Q. He, Parallel K-means clustering based on mapreduce, in: CloudCom’09: Proceedings of the 1st International Conference on Cloud Computing, Springer, 2009, pp. 674–679. doi:10.1007/978-3-642-10665-1_71

  24. [25]

    doi:10.1109/IEEESTD

    IEEE Standard for Floating-Point Arithmetic,IEEE Std 754-2019 (Revision of IEEE Std 754-2008), IEEE, 2019. doi:10.1109/IEEESTD. 2019.8766229

  25. [26]

    Carson, N

    E. Carson, N. J. Higham, Accelerating the solution of linear systems by iterative refinement in three precisions, SIAM J. Sci. Comput. 40 (2) (2018) A817–A847. doi:10.1137/17M1140819

  26. [27]

    Haidar, H

    A. Haidar, H. Bayraktar, S. Tomov, J. Dongarra, N. J. Higham, Mixed-precision iterative refinement using tensor cores on GPUs to accelerate solution of linear systems, Proc. Roy. Soc. London A 476 (2243) (2020) 20200110. doi:10.1098/rspa.2020.0110

  27. [28]

    Carson, N

    E. Carson, N. J. Higham, S. Pranesh, Three-precision GMRES-based iterative refinement for least squares problems, SIAM J. Sci. Comput. 42 (6) (2020) A4063–A4083. doi:10.1137/20m1316822

  28. [29]

    N. J. Higham, S. Pranesh, Exploiting lower precision arithmetic in solving symmetric positive definite linear systems and least squares problems, SIAM J. Sci. Comput. 43 (1) (2021) A258–A277. doi:10.1137/19M1298263

  29. [30]

    Training deep neural networks with low precision multiplications

    M. Courbariaux, Y . Bengio, J.-P. David, Training deep neural networks with low precision multiplications, ArXiv:1412.7024v5 (2015). URL https://arxiv.org/abs/1412.7024v5

  30. [31]

    Micikevicius, S

    P. Micikevicius, S. Narang, J. Alben, G. Diamos, E. Elsen, D. Garcia, B. Ginsburg, M. Houston, O. Kuchaiev, G. Venkatesh, H. Wu, Mixed precision training, in: ICLR 2018: 6th International Conference on Learning Representations, 2018. URL https://openreview.net/forum?id=r1gs9JgRZ

  31. [32]

    N. J. Higham, S. Pranesh, M. Zounon, Squeezing a matrix into half precision, with an application to solving linear systems, SIAM Journal on Scientific Computing 41 (4) (2019) A2536–A2551

  32. [33]

    Abdelfattah, H

    A. Abdelfattah, H. Anzt, E. G. Boman, E. Carson, T. Cojean, J. Dongarra, A. Fox, M. Gates, N. J. Higham, X. S. Li, J. Loe, P. Luszczek, S. Pranesh, S. Rajamanickam, T. Ribizel, B. F. Smith, K. Swirydowicz, S. Thomas, S. Tomov, Y . M. Tsai, U. M. Yang, A survey of numerical linear algebra methods utilizing mixed-precision arithmetic, Int. J. High Perform. ...

  33. [34]

    N. J. Higham, T. Mary, Mixed precision algorithms in numerical linear algebra, Acta Numerica 31 (2022) 347–414. doi:10.1017/ s0962492922000022

  34. [35]

    B. Rokh, A. Azarpeyvand, A. Khanteymoori, A comprehensive survey on model quantization for deep neural networks in image classification, ACM Trans. Intell. Syst. Technol. 14 (6) (2023) 1–50. doi:10.1145/3623402

  35. [36]

    N. Wang, J. Choi, D. Brand, C.-Y . Chen, K. Gopalakrishnan, Training deep neural networks with 8-bit floating point numbers, in: S. Bengio, H. M. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi (Eds.), NIPS’18: Proceedings of the 32nd International Conference on Neural Information Processing Systems, Curran Associates Inc., New York, USA, 2018, pp. 76...

  36. [37]

    M. F. Balcan, S. Ehrlich, Y . Liang, Distributedk-means and k-median clustering on general topologies, in: C. J. Burges, L. Bottou, M. Welling, Z. Ghahramani, K. Q. Weinberger (Eds.), NIPS’2013: Proceedings of the 26th International Conference on Neural Information Processing Systems - V olume 2, Curran Associates Inc., New York, USA, 2013, pp. 1995–2003

  37. [38]

    P. K. Agarwal, S. Har-Peled, K. R. Varadarajan, Geometric approximation via coresets, Combinatorial and Computational Geometry 52 (2005) 1–30

  38. [39]

    Har-Peled, S

    S. Har-Peled, S. Mazumdar, On coresets for k-means and k-median clustering, in: L. Babai (Ed.), STOC04: Symposium of Theory of Computing 2004, ACM Press, New York, 2004, pp. 291–300. doi:10.1145/1007352.1007400

  39. [40]

    J. Dean, S. Ghemawat, MapReduce: simplified data processing on large clusters, Communications of the ACM 51 (1) (2008) 107–113. doi:10.1145/1327452.1327492

  40. [41]

    Bai, L.-l

    H.-t. Bai, L.-l. He, D.-t. Ouyang, Z.-s. Li, H. Li, K-means on commodity GPUs with CUDA, in: CSIE’09: Proceedings of the 2009 WRI World Congress on Computer Science and Information Engineering, IEEE, Washington, DC, USA, 2009, pp. 654–655. doi:10.1109/ CSIE.2009.491

  41. [42]

    C. Lutz, S. Breß, T. Rabl, S. Zeuch, V . Markl, E fficient k-means on GPUs, in: W. Lehner, K. Salem (Eds.), DAMON’18: Proceedings of the 14th International Workshop on Data Management on New Hardware, ACM Press, New York, USA, 2018, pp. 3:1–3:3. doi: 10.1145/3211922.3211925

  42. [43]

    M. Li, E. Frank, B. Pfahringer, Large scale K-means clustering using GPUs, Data Mining and Knowledge Discovery 37 (1) (2023) 67–109. doi:10.1007/s10618-022-00869-6

  43. [44]

    Hastie, R

    T. Hastie, R. Tibshirani, J. Friedman, The Elements of Statistical Learning: Data Mining, Inference and Prediction, 2nd Edition, Springer- Verlag, New York, USA, 2009. doi:10.1007/978-0-387-84858-7

  44. [45]

    Io ffe, C

    S. Io ffe, C. Szegedy, Batch normalization: Accelerating deep network training by reducing internal covariate shift, in: F. Bach, D. Blei (Eds.), ICML’15: Proceedings of the 32nd International Conference on International Conference on Machine Learning, V ol. 37, JMLR.org, 2015, pp. 448–456. URL https://dl.acm.org/doi/10.5555/3045118.3045167

  45. [46]

    J. L. Ba, J. R. Kiros, G. E. Hinton, Layer normalization, ArXiv:1607.06450 (2016). URL https://arxiv.org/abs/1607.06450

  46. [47]

    Xiong, Y

    R. Xiong, Y . Yang, D. He, K. Zheng, S. Zheng, C. Xing, H. Zhang, Y . Lan, L. Wang, T. Liu, On layer normalization in the transformer architecture, in: H. Daum ´e, A. Singh (Eds.), ICML’20: Proceedings of the 37th International Conference on Machine Learning, JMLR.org, 2020, pp. 10524–10533. 21 (a) mp k-means++ low (b) mp k-means++ mp Figure 7.7: Results ...

  47. [48]

    J. Xu, X. Sun, Z. Zhang, G. Zhao, J. Lin, Understanding and improving layer normalization, in: H. M. Wallach, H. Larochelle, A. Beygelzimer, F. d’Alch´e-Buc, E. B. Fox (Eds.), NIPS’19: Proceedings of the 33rd International Conference on Neural Information Processing Systems, Curran Associates Inc., 2019, p. 4381–4391. URL https://dl.acm.org/doi/10.5555/34...

  48. [49]

    Lloyd, Least squares quantization in PCM, IEEE Trans

    S. Lloyd, Least squares quantization in PCM, IEEE Trans. Inform. Theory 28 (2) (1982) 129–137. doi:10.1109/TIT.1982.1056489

  49. [50]

    Bl ¨omer, C

    J. Bl ¨omer, C. Lammersen, M. Schmidt, C. Sohler, Theoretical analysis of the k-means algorithm–a survey, in: L. Kliemann, P. Sanders (Eds.), Algorithm Engineering, V ol. 9220 of Lecture Notes in Computer Science, Springer-Verlag, Cham, Switzerland, 2016, pp. 81–116. doi:10.1007/978-3-319-49487-6_3

  50. [51]

    S. Z. Selim, M. A. Ismail, K-means-type algorithms: A generalized convergence theorem and characterization of local optimality, IEEE Transactions on pattern analysis and machine intelligence PAMI-6 (1) (1984) 81–87. doi:10.1109/TPAMI.1984.4767478

  51. [52]

    Kanungo, D

    T. Kanungo, D. M. Mount, N. S. Netanyahu, C. D. Piatko, R. Silverman, A. Y . Wu, A local search approximation algorithm for k-means clustering, in: SCG’02: Proceedings of the 18th annual symposium on Computational geometry, ACM Press, New York, USA, 2002, pp. 10–18. doi:10.1145/513400.513402

  52. [53]

    Bubeck, M

    S. Bubeck, M. Meil ˘a, U. von Luxburg, How the initialization affects the stability of the k-means algorithm, ESAIM: Probability and Statistics 16 (2012) 436–452. doi:10.1051/ps/2012013

  53. [54]

    Pedregosa, G

    F. Pedregosa, G. Varoquaux, A. Gramfort, V . Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V . Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, E. Duchesnay, Scikit-learn: Machine Learning in Python, Journal of Machine Learning Research 12 (2011) 2825–2830

  54. [55]

    L. S. Blackford, J. Demmel, J. Dongarra, I. Du ff, S. Hammarling, G. Henry, M. Heroux, L. Kaufman, A. Lumsdaine, A. Petitet, R. Pozo, K. Remington, R. C. Whaley, An updated set of basic linear algebra subprograms (BLAS), ACM Trans. Math. Softw. (TOMS) 28 (2) (2002) 135–151. doi:10.1145/567806.567807

  55. [56]

    X. Chen, S. G ¨uttel, Fast and explainable clustering based on sorting, Pattern Recognition 150 (2024) 110298. doi:10.1016/j.patcog. 2024.110298

  56. [57]

    N. J. Higham, Accuracy and Stability of Numerical Algorithms, 2nd Edition, Society for Industrial and Applied Mathematics, Philadelphia, PA, USA, 2002. doi:10.1137/1.9780898718027

  57. [58]

    G. H. Golub, C. F. Van Loan, Matrix Computations, 4th Edition, Johns Hopkins University Press, Baltimore, MD, USA, 2013

  58. [59]

    N. J. Higham, Functions of Matrices: Theory and Computation, Society for Industrial and Applied Mathematics, Philadelphia, PA, USA,

  59. [60]

    doi:10.1137/1.9780898717778

  60. [61]

    M. Fasi, N. J. Higham, F. Lopez, T. Mary, M. Mikaitis, Matrix multiplication in multiword arithmetic: Error analysis and application to GPU tensor cores, SIAM J. Sci. Comput. 45 (1) (2023) C1–C19. doi:10.1137/21m1465032

  61. [62]

    Liu, Mixed-precision Paterson–Stockmeyer method for evaluating polynomials of matrices, ArXiv:2312.17396v2 (2023)

    X. Liu, Mixed-precision Paterson–Stockmeyer method for evaluating polynomials of matrices, ArXiv:2312.17396v2 (2023). URL https://arxiv.org/abs/2312.17396v2

  62. [63]

    N. X. Vinh, J. Epps, J. Bailey, Information theoretic measures for clusterings comparison: Variants, properties, normalization and correction for chance, Journal of Machine Learning Research 11 (95) (2010) 2837–2854. 22

  63. [64]

    N. J. Higham, S. Pranesh, Simulating low precision floating-point arithmetic, SIAM J. Sci. Comput. 41 (5) (2019) C585–C602. doi: 10.1137/19M1251308

  64. [65]

    Rosenberg, J

    A. Rosenberg, J. Hirschberg, V-measure: A conditional entropy-based external cluster evaluation measure, in: J. Eisner (Ed.), Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, Asso- ciation for Computational Linguistics, 2007, p. 410–420. URL https://aclanthology.org/D07-1043

  65. [66]

    Mikhailov

    P. Fr ¨anti, O. Virmajoki, Iterative shrinking method for clustering problems, Pattern Recognition 39 (5) (2006) 761–775. doi:10.1016/j. patcog.2005.09.012

  66. [67]

    D. Dua, C. Gra ff, UCI machine learning repository, http://archive.ics.uci.edu/ml (2017)

  67. [68]

    H. A. G ¨uvenir, G. Demir ¨oz, N. Ilter, Learning di fferential diagnosis of erythemato-squamous diseases using voting feature intervals, Artif. Intell. Med. 13 (3) (1998) 147–165. doi:10.1016/s0933-3657(98)00028-1

  68. [69]

    Nakai, M

    K. Nakai, M. Kanehisa, Expert system for predicting protein localization sites in gram-negative bacteria, Proteins 11 (2) (1991) 95–110. doi:10.1002/prot.340110203

  69. [70]

    Nakai, M

    K. Nakai, M. Kanehisa, A knowledge base for predicting protein localization sites in eukaryotic cells, Genomics 14 (4) (1992) 897–911. doi:10.1016/s0888-7543(05)80111-9

  70. [71]

    Anderson, The species problem in Iris, Annals of the Missouri Botanical Garden 23 (3) (1936) 457–509

    E. Anderson, The species problem in Iris, Annals of the Missouri Botanical Garden 23 (3) (1936) 457–509. doi:10.2307/2394164

  71. [72]

    R. A. Fisher, The use of multiple measurements in taxonomic problems, Annals of Eugenics 7 (2) (1936) 179–188. doi:10.1111/j. 1469-1809.1936.tb02137.x

  72. [73]

    Forina, R

    M. Forina, R. Leardi, C. Armanino, S. Lanteri, PARVUS: An Extendable Package of Programs for Data Exploration, Classification and Correlation, Elsevier, Amsterdam, The Netherlands, 1988

  73. [74]

    J. Deng, W. Dong, R. Socher, L.-J. Li, L. Kai, F.-F. Li, ImageNet: A large-scale hierarchical image database, in: 2009 IEEE Conference on Computer Vision and Pattern Recognition, IEEE, 2009, pp. 248–255. doi:10.1109/CVPR.2009.5206848

  74. [75]

    J. Jang, H. Jiang, DBSCAN ++: Towards fast and scalable density clustering, in: K. Chaudhuri, R. Salakhutdinov (Eds.), Proceedings of the 36th International Conference on Machine Learning, V ol. 97, PMLR, 2019, pp. 3019–3029. URL https://proceedings.mlr.press/v97/jang19a.html

  75. [76]

    Ester, H.-P

    M. Ester, H.-P. Kriegel, J. Sander, X. Xu, A density-based algorithm for discovering clusters in large spatial databases with noise, in: E. Simoudis, J. H. U. Fayyad (Eds.), KDD’96: Proceedings of the Second International Conference on Knowledge Discovery and Data Mining, AAAI Press, 1996, pp. 226–231. URL https://dl.acm.org/doi/10.5555/3001460.3001507

  76. [77]

    J. B. Tenenbaum, V . de Silva, J. C. Langford, A global geometric framework for nonlinear dimensionality reduction, Science 290 (5500) (2000) 2319–2323. doi:10.1126/science.290.5500.2319

  77. [78]

    X. Chen, S. G ¨uttel, Fast and exact fixed-radius neighbor search based on sorting, PeerJ Computer Science (2024) 10:e1929 doi:10.7717/ peerj-cs.1929. 23