arxiv: 2604.22099 · v2 · submitted 2026-04-23 · 💻 cs.LG

Recognition: no theorem link

Assessing the impact of dimensionality reduction on clustering performance -- a systematic study

\'Emilie Roy, Mohammadreza Bakhtyari, Ousmane Assani Amate, Vladimir Makarenkov

Authors on Pith no claims yet

Pith reviewed 2026-05-13 06:10 UTC · model grok-4.3

classification 💻 cs.LG

keywords dimensionality reductionclusteringPCAVAEIsomapadjusted Rand indexhigh-dimensional datak-means

0 comments

The pith

Dimensionality reduction changes clustering results in ways that require matching the technique and reduction level to the data's geometry and the algorithm.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper evaluates five dimensionality reduction methods applied before four clustering algorithms to measure effects on clustering quality. It tests reduction to k-1 dimensions, 25 percent, or 50 percent of the original feature count and tracks changes using the adjusted Rand index. The central finding is that no fixed choice of method or level works best across cases, so selections must be adjusted to each dataset's structure and the clustering approach. Readers care because many real-world tasks involve high-dimensional inputs where default preprocessing can either help or harm downstream grouping accuracy. The work extends prior comparisons by covering multiple algorithms and reduction targets in one controlled setup.

Core claim

The authors establish that the performance impact of dimensionality reduction on clustering is not uniform: PCA, Kernel PCA, VAE, Isomap, and MDS each interact differently with k-means, agglomerative hierarchical clustering, Gaussian mixture models, and OPTICS depending on the target dimensionality and the intrinsic geometry of the input data. Comparisons without reduction versus reduction at the literature-suggested levels show that quality measured by adjusted Rand index can rise or fall, confirming that practitioners must choose both the reduction technique and the number of retained dimensions with reference to the specific data and clustering method.

What carries the argument

The controlled experimental loop that applies each of the five dimensionality reduction techniques at three literature-recommended target dimensionalities before running each of the four clustering algorithms and scoring the output with adjusted Rand index.

If this is right

For some algorithm-data pairs, reduction to exactly k-1 dimensions yields the largest gain in adjusted Rand index.
Kernel-based and manifold methods such as Kernel PCA and Isomap outperform linear PCA on data with nonlinear structure.
Omitting dimensionality reduction entirely is sometimes preferable to applying a mismatched technique.
The same reduction level can improve one clustering algorithm while degrading another on identical input data.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Practitioners should run a small grid search over reduction methods and levels on a held-out subset before committing to a preprocessing pipeline.
The results imply that clustering benchmarks that fix a single reduction step may underestimate the best attainable performance for a given algorithm.
Future work could test whether an automated selector that inspects data geometry metrics can predict the best reduction choice without exhaustive trials.

Load-bearing premise

The tested datasets and the three specific reduction levels drawn from prior literature are representative enough to support general advice on tailoring choices to data geometry.

What would settle it

A new collection of high-dimensional datasets where one fixed reduction choice, such as PCA to 50 percent, produces higher adjusted Rand index than any tailored selection for all four clustering algorithms would falsify the claim that tailoring is required.

Figures

Figures reproduced from arXiv: 2604.22099 by \'Emilie Roy, Mohammadreza Bakhtyari, Ousmane Assani Amate, Vladimir Makarenkov.

**Figure 1.** Figure 1: Examples of synthetic datasets with two clusters and 50 original dimensions, visualized by projecting [PITH_FULL_IMAGE:figures/full_fig_p006_1.png] view at source ↗

**Figure 2.** Figure 2: Overview of the evaluation pipeline assessing the effect of dimensionality reduction on clustering. [PITH_FULL_IMAGE:figures/full_fig_p012_2.png] view at source ↗

**Figure 3.** Figure 3: Boxplot summarizing ARI scores for different dimensionality reduction methods applied to synthetic [PITH_FULL_IMAGE:figures/full_fig_p014_3.png] view at source ↗

**Figure 4.** Figure 4: Boxplot summarizing ARI scores for different dimensionality reduction methods applied to synthetic [PITH_FULL_IMAGE:figures/full_fig_p015_4.png] view at source ↗

**Figure 5.** Figure 5: Boxplot summarizing ARI scores for different dimensionality reduction methods applied to synthetic [PITH_FULL_IMAGE:figures/full_fig_p015_5.png] view at source ↗

**Figure 6.** Figure 6: Boxplot summarizing ARI scores for different dimensionality reduction methods applied to synthetic [PITH_FULL_IMAGE:figures/full_fig_p016_6.png] view at source ↗

**Figure 7.** Figure 7: Boxplot summarizing ARI scores for different dimensionality reduction methods applied to real-world [PITH_FULL_IMAGE:figures/full_fig_p017_7.png] view at source ↗

**Figure 8.** Figure 8: Boxplot summarizing ARI scores for different dimensionality reduction methods applied to real-world [PITH_FULL_IMAGE:figures/full_fig_p018_8.png] view at source ↗

**Figure 9.** Figure 9: Boxplot summarizing ARI scores for different dimensionality reduction methods applied to real-world [PITH_FULL_IMAGE:figures/full_fig_p018_9.png] view at source ↗

**Figure 10.** Figure 10: Boxplot summarizing ARI scores for different dimensionality reduction methods applied to real-world [PITH_FULL_IMAGE:figures/full_fig_p019_10.png] view at source ↗

read the original abstract

Dimensionality reduction is a critical preprocessing step for clustering high-dimensional data, yet comprehensive evaluation of its impact across diverse methods and data types remains limited. In this study, we systematically assess the influence of five dimensionality reduction techniques - Principal Component Analysis (PCA), Kernel Principal Component Analysis (Kernel PCA), Variational Autoencoder (VAE), Isometric Mapping (Isomap), and Multidimensional Scaling (MDS) - on the performance of four popular clustering algorithms - k-means, Agglomerative Hierarchical Clustering (AHC), Gaussian Mixture Models (GMM), and Ordering Points to Identify the Clustering Structure (OPTICS). We evaluate clustering quality using the Adjusted Rand Index (ARI), comparing results without and with dimensionality reduction at different reduction levels recommended in the literature (i.e., k-1, where k is the number of clusters, and 25% and 50% of the original number of dimensions). Our findings underscore the importance of a careful selection of the dimensionality reduction technique and the dimensionality reduction level that should be tailored to intrinsic data geometry and clustering algorithms under consideration.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

A clean empirical grid of five DR methods against four clusterers at three fixed levels, but the tailoring claim rests on standard datasets and non-optimized reductions.

read the letter

The paper's main contribution is the specific matrix: PCA, Kernel PCA, VAE, Isomap, and MDS paired with k-means, AHC, GMM, and OPTICS, evaluated by ARI at the literature defaults of k-1, 25%, and 50% of original dimensions, plus no-reduction baselines. That exact combination has not been laid out together before, so the results table is the new piece. The setup is straightforward and covers methods that people actually use, which makes the comparison practically relevant rather than purely academic. They stick to an external metric and report the variation across choices, which is the right way to do this kind of study. The central observation that no single DR technique dominates is consistent with what most people see in practice. The soft spot is the jump from these runs to the recommendation that choices must be tailored to intrinsic data geometry. The reduction levels are taken from prior literature rather than searched or adapted per dataset, and the abstract leaves the actual datasets and their variety unspecified. If the collection is mostly common benchmarks without extreme manifold curvature or heavy noise, the performance gaps observed may not be strong enough to support general guidance. The stress-test concern about representativeness holds up on the information given. This is useful for applied researchers who need a quick reference when choosing preprocessing for clustering pipelines. It is not foundational and I would not cite it in my own theoretical work, but the experimental design is honest and reproducible enough that a serious editor should send it to referees for comments on the dataset selection and any missing variability controls. The thinking is clear and the authors engage the literature directly without overclaiming theory.

Referee Report

3 major / 2 minor

Summary. The paper presents a systematic empirical assessment of five dimensionality reduction techniques (PCA, Kernel PCA, VAE, Isomap, MDS) applied to four clustering algorithms (k-means, AHC, GMM, OPTICS), evaluating their impact on clustering quality via the Adjusted Rand Index (ARI) at reduction levels of k-1, 25%, and 50% of the original dimensions, as recommended in the literature. The authors conclude that the selection of DR technique and reduction level should be tailored to the intrinsic geometry of the data and the specific clustering algorithm.

Significance. If substantiated by robust experiments across representative datasets, this work could provide practical insights into preprocessing high-dimensional data for clustering, emphasizing that generic DR choices may not be optimal. The multi-method comparison using a standard metric like ARI is a positive aspect for the field of machine learning.

major comments (3)

[Abstract] Abstract: The abstract provides no information on the datasets used, the number of experimental runs, or any statistical testing for significance. This is load-bearing for the central claim, as the recommendation to tailor DR to 'intrinsic data geometry' cannot be evaluated without evidence that the tested data cover a sufficient range of geometries, separability, and noise characteristics, and that differences are statistically reliable.
[Abstract and Methods] Experimental design (Abstract and likely Methods section): Reduction levels are limited to the fixed literature-recommended values (k-1, 25%, 50%). The claim that levels 'should be tailored to intrinsic data geometry' is not directly supported if the experiments do not explore other levels or demonstrate that these fixed choices are suboptimal for particular geometries or algorithms.
[Results and Discussion] Results/Discussion: The manuscript needs to explicitly link performance variations to measurable data characteristics (e.g., manifold dimensionality, cluster separability) to justify the 'tailored to intrinsic data geometry' conclusion; without this, the general guidance remains under-supported by the reported comparisons.

minor comments (2)

[Throughout] Ensure consistent definition of all acronyms (e.g., ARI, VAE) on first use and clarify any notation for reduction levels in tables or figures.
[Methods] Consider adding a table summarizing dataset characteristics (dimensionality, number of clusters, sample size) to improve reproducibility and context for the geometry claims.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed comments. We address each major point below and will revise the manuscript accordingly to improve clarity and evidential support for our conclusions.

read point-by-point responses

Referee: [Abstract] Abstract: The abstract provides no information on the datasets used, the number of experimental runs, or any statistical testing for significance. This is load-bearing for the central claim, as the recommendation to tailor DR to 'intrinsic data geometry' cannot be evaluated without evidence that the tested data cover a sufficient range of geometries, separability, and noise characteristics, and that differences are statistically reliable.

Authors: We agree that the abstract should summarize key aspects of the experimental design to better support the claims. The revised abstract will include information on the datasets used, the number of experimental runs, and any statistical testing performed on the ARI results. revision: yes
Referee: [Abstract and Methods] Experimental design (Abstract and likely Methods section): Reduction levels are limited to the fixed literature-recommended values (k-1, 25%, 50%). The claim that levels 'should be tailored to intrinsic data geometry' is not directly supported if the experiments do not explore other levels or demonstrate that these fixed choices are suboptimal for particular geometries or algorithms.

Authors: The experiments were restricted to the reduction levels commonly recommended in the literature to focus on practical guidance. While this design does not directly test other levels or prove suboptimality in all cases, the observed performance differences across techniques at these levels already indicate that no universal choice is optimal. We will revise the Discussion to qualify the tailoring claim more precisely and note the limitation regarding unexplored levels. revision: partial
Referee: [Results and Discussion] Results/Discussion: The manuscript needs to explicitly link performance variations to measurable data characteristics (e.g., manifold dimensionality, cluster separability) to justify the 'tailored to intrinsic data geometry' conclusion; without this, the general guidance remains under-supported by the reported comparisons.

Authors: We will revise the Results and Discussion sections to add explicit analysis connecting performance variations to measurable data characteristics, including estimates of intrinsic dimensionality and cluster separability metrics, with discussion of how these relate to the effectiveness of different DR techniques and algorithms. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical evaluation

full rationale

The paper performs a systematic empirical comparison of five DR methods on four clustering algorithms across literature-recommended reduction levels, measuring outcomes with the external ARI metric. No derivations, fitted parameters, predictions, or self-citations appear in the load-bearing claims; results are benchmarked directly against ground-truth labels on the tested datasets. The central recommendation to tailor choices follows from observed performance differences rather than any definitional or self-referential reduction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The study rests on standard machine-learning assumptions about data geometry and the validity of ARI as a clustering quality measure; no free parameters or invented entities are introduced.

axioms (1)

domain assumption The reduction levels k-1, 25% and 50% of original dimensions are appropriate benchmarks as recommended in the literature.
Stated in abstract as the levels used for comparison.

pith-pipeline@v0.9.0 · 5499 in / 1193 out tokens · 126385 ms · 2026-05-13T06:10:20.934995+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Reference graph

Works this paper leans on

58 extracted references · 58 canonical work pages · 1 internal anchor

[1]

A. E. Ezugwu, A. M. Ikotun, O. O. Oyelade, L. Abualigah, J. O. Agushaka, C. I. Eke, A. A. Akinyelu, A comprehensive survey of clustering algorithms: State-of-the- art machine learning applications, taxonomy, challenges, and future research prospects, Engineering Applications of Artificial Intelligence 110 (2022) 104743.doi:10.1016/j. engappai.2022.104743

work page doi:10.1016/j 2022
[2]

G. J. Oyewole, G. A. Thopil, Data clustering: application and trends, Artificial intelli- gence review 56 (7) (2023) 6439–6475.doi:10.1007/s10462-022-10325-y

work page doi:10.1007/s10462-022-10325-y 2023
[3]

Tahiri, M

N. Tahiri, M. Willems, V. Makarenkov, A new fast method for inferring multiple con- sensus trees using k-medoids, BMC evolutionary biology 18 (1) (2018) 48

work page 2018
[4]

S. Zhou, H. Xu, Z. Zheng, J. Chen, Z. Li, J. Bu, J. Wu, X. Wang, W. Zhu, M. Ester, A comprehensive survey on deep clustering: Taxonomy, challenges, and future directions, ACM Comput. Surv. 57 (3) (Nov. 2024).doi:10.1145/3689036. URLhttps://doi.org/10.1145/3689036

work page doi:10.1145/3689036 2024
[5]

A. A. Wani, Comprehensive analysis of clustering algorithms: exploring limitations and innovative solutions, PeerJ Computer Science 10 (2024) e2286.doi:10.7717/peerj-cs. 2286

work page doi:10.7717/peerj-cs 2024
[6]

M. C. Thrun, Distance-based clustering challenges for unbiased benchmarking studies, Scientific reports 11 (1) (2021) 18988.doi:10.1038/s41598-021-98126-1

work page doi:10.1038/s41598-021-98126-1 2021
[7]

W. Jia, M. Sun, J. Lian, S. Hou, Feature dimensionality reduction: a review, Complex & Intelligent Systems 8 (3) (2022) 2663–2693

work page 2022
[8]

H. Niu, G. B. McCallum, A. B. Chang, K. Khan, S. Azam, Exploring unsupervised feature extraction algorithms: tackling high dimensionality in small datasets, Scientific Reports 15 (1) (2025) 21973.doi:10.1038/s41598-025-07725-9. 24

work page doi:10.1038/s41598-025-07725-9 2025
[9]

Liii. on lines and planes of closest fit to systems of points in space,

K. Pearson, Liii. on lines and planes of closest fit to systems of points in space, The London, Edinburgh, and Dublin philosophical magazine and journal of science 2 (11) (1901) 559–572.doi:10.1080/14786440109462720

work page doi:10.1080/14786440109462720 1901
[10]

Schölkopf, A

B. Schölkopf, A. Smola, K.-R. Müller, Nonlinear component analysis as a kernel eigenvalue problem, Neural Computation 10 (5) (1998) 1299–1319.doi:10.1162/ 089976698300017467

work page 1998
[11]

J. B. Kruskal, Multidimensional scaling by optimizing goodness of fit to a nonmetric hypothesis, Psychometrika 29 (1) (1964) 1–27.doi:10.1007/BF02289565

work page doi:10.1007/bf02289565 1964
[12]

Science376(6594), 5197 (2022) https://doi.org/10.1126/science

J.B.Tenenbaum, V.deSilva, J.C.Langford, Aglobalgeometricframeworkfornonlinear dimensionality reduction, Science 290 (5500) (2000) 2319–2323.doi:10.1126/science. 290.5500.2319

work page doi:10.1126/science 2000
[13]

S.T.Roweis, L.K.Saul, Nonlineardimensionalityreductionbylocallylinearembedding, Science 290 (5500) (2000) 2323–2326.doi:10.1126/science.290.5500.2323

work page doi:10.1126/science.290.5500.2323 2000
[14]

Belkin, P

M. Belkin, P. Niyogi, Laplacian eigenmaps for dimensionality reduction and data representation, Neural Computation 15 (6) (2003) 1373–1396.doi:10.1162/ 089976603321780317

work page 2003
[15]

2006.04.006

R.R.Coifman, S.Lafon, Diffusionmaps, AppliedandComputationalHarmonicAnalysis 21 (1) (2006) 5–30, special Issue: Diffusion Maps and Wavelets.doi:10.1016/j.acha. 2006.04.006

work page doi:10.1016/j.acha 2006
[16]

van der Maaten, G

L. van der Maaten, G. Hinton, Visualizing data using t-sne, Journal of Machine Learning Research 9 (86) (2008) 2579–2605. URLhttp://jmlr.org/papers/v9/vandermaaten08a.html

work page 2008
[17]

Healy, L

J. Healy, L. McInnes, Uniform manifold approximation and projection, Nature Reviews Methods Primers 4 (1) (2024) 82.doi:10.1038/s43586-024-00363-x

work page doi:10.1038/s43586-024-00363-x 2024
[18]

G. E. Hinton, R. R. Salakhutdinov, Reducing the dimensionality of data with neural networks, Science 313 (5786) (2006) 504–507.doi:10.1126/science.1127647

work page doi:10.1126/science.1127647 2006
[19]

D. P. Kingma, M. Welling, Auto-encoding variational bayes (2013).arXiv:1312.6114. URLhttps://arxiv.org/abs/1312.6114

work page internal anchor Pith review Pith/arXiv arXiv 2013
[20]

doi:10.1007/s10462-020-09928-0

P.Ray, S.S.Reddy, T.Banerjee, Variousdimensionreductiontechniquesforhigh dimen- sional data analysis: a review, Artificial Intelligence Review 54 (5) (2021) 3473–3515. doi:10.1007/s10462-020-09928-0

work page doi:10.1007/s10462-020-09928-0 2021
[21]

M. Z. Rodriguez, C. H. Comin, D. Casanova, O. M. Bruno, D. R. Amancio, L. d. F. Costa, F. A. Rodrigues, Clustering algorithms: A comparative approach, PloS one 14 (1) (2019) e0210236.doi:10.1371/journal.pone.0210236. 25

work page doi:10.1371/journal.pone.0210236 2019
[22]

C. Ding, X. He, K-means clustering via principal component analysis, in: Proceedings of the Twenty-First International Conference on Machine Learning, ICML ’04, Association for Computing Machinery, New York, NY, USA, 2004, p. 29.doi:10.1145/1015330. 1015408

work page doi:10.1145/1015330 2004
[23]

Allaoui, M

M. Allaoui, M. L. Kherfi, A. Cheriet, Considerably improving clustering algo- rithms using umap dimensionality reduction technique: A comparative study, in: A. El Moataz, D. Mammass, A. Mansouri, F. Nouboud (Eds.), Image and Signal Pro- cessing, Springer International Publishing, Cham, 2020, pp. 317–325.doi:10.1007/ 978-3-030-51935-3_34

work page 2020
[24]

Alkhayrat, M

M. Alkhayrat, M. Aljnidi, K. Aljoumaa, A comparative dimensionality reduction study in telecom customer segmentation using deep learning and pca, Journal of Big Data 7 (1) (2020) 9.doi:10.1186/s40537-020-0286-0

work page doi:10.1186/s40537-020-0286-0 2020
[25]

Rovira, K

M. Rovira, K. Engvall, C. Duwig, Identifying key features in reactive flows: A tutorial on combining dimensionality reduction, unsupervised clustering, and feature correlation, Chemical Engineering Journal 438 (2022) 135250.doi:10.1016/j.cej.2022.135250

work page doi:10.1016/j.cej.2022.135250 2022
[26]

Y. Sun, L. Kong, J. Huang, H. Deng, X. Bian, X. Li, F. Cui, L. Dou, C. Cao, Q. Zou, Z. Zhang, A comprehensive survey of dimensionality reduction and clustering methods for single-cell and spatial transcriptomics data, Briefings in Functional Genomics 23 (6) (2024) 733–744.doi:10.1093/bfgp/elae023

work page doi:10.1093/bfgp/elae023 2024
[27]

J. Xia, Y. Zhang, J. Song, Y. Chen, Y. Wang, S. Liu, Revisiting dimensionality reduc- tion techniques for visual cluster analysis: An empirical study, IEEE Transactions on Visualization and Computer Graphics 28 (1) (2022) 529–539.doi:10.1109/TVCG.2021. 3114694

work page doi:10.1109/tvcg.2021 2022
[28]

Overview and comparative study of dimensionality reduction techniques for high dimensional data.Information Fusion, 59:44–58, July 2020

S. Ayesha, M. K. Hanif, R. Talib, Overview and comparative study of dimensionality reduction techniques for high dimensional data, Information Fusion 59 (2020) 44–58. doi:10.1016/j.inffus.2020.01.005

work page doi:10.1016/j.inffus.2020.01.005 2020
[29]

Herrmann, D

M. Herrmann, D. Kazempour, F. Scheipl, P. Kröger, Enhancing cluster analysis via topological manifold learning, Data Mining and Knowledge Discovery 38 (3) (2024) 840–887.doi:10.1007/s10618-023-00980-2

work page doi:10.1007/s10618-023-00980-2 2024
[30]

C. C. Aggarwal, A. Hinneburg, D. A. Keim, On the surprising behavior of distance metrics in high dimensional space, in: Database Theory — ICDT 2001, Springer Berlin Heidelberg, Berlin, Heidelberg, 2001, pp. 420–434.doi:10.1007/3-540-44503-X_27

work page doi:10.1007/3-540-44503-x_27 2001
[31]

nearest neighbor

K. Beyer, J. Goldstein, R. Ramakrishnan, U. Shaft, When is “nearest neighbor” meaningful?, in: Database Theory — ICDT’99, Springer, 1999, pp. 217–235.doi: 10.1007/3-540-49257-7_15

work page doi:10.1007/3-540-49257-7_15 1999
[32]

M. E. Tipping, C. M. Bishop, Mixtures of probabilistic principal component analyzers, Neural Computation 11 (2) (1999) 443–482.doi:10.1162/089976699300016728. 26

work page doi:10.1162/089976699300016728 1999
[33]

J. C.-H. Tseng, B.-A. Tsai, K. Chung, Sea surface temperature clustering and prediction in the pacific ocean based on isometric feature mapping analysis, Geoscience Letters 10 (1) (2023) 42.doi:10.1186/s40562-023-00295-6

work page doi:10.1186/s40562-023-00295-6 2023
[34]

H. R. Roh, C. S. Kim, Y. Lee, J. M. Lee, Dimensionality reduction for clustering of nonlinear industrial data: A tutorial, Korean Journal of Chemical Engineering (2025) 1–15doi:10.1007/s11814-025-00402-7

work page doi:10.1007/s11814-025-00402-7 2025
[35]

E. S. Dalmaijer, C. L. Nord, D. E. Astle, Statistical power for cluster analysis, BMC bioinformatics 23 (1) (2022) 205.doi:10.1186/s12859-022-04675-1

work page doi:10.1186/s12859-022-04675-1 2022
[36]

Ditta, A

J. Lötsch, A. Ultsch, Comparative assessment of projection and clustering method com- binations in the analysis of biomedical data, Informatics in Medicine Unlocked 50 (2024) 101573.doi:10.1016/j.imu.2024.101573

work page doi:10.1016/j.imu.2024.101573 2024
[37]

Hozumi, R

Y. Hozumi, R. Wang, C. Yin, G.-W. Wei, Umap-assisted k-means clustering of large- scale sars-cov-2 mutation datasets, Computers in Biology and Medicine 131 (2021) 104264.doi:10.1016/j.compbiomed.2021.104264

work page doi:10.1016/j.compbiomed.2021.104264 2021
[38]

Kobak, G

D. Kobak, G. C. Linderman, Initialization is critical for preserving global data structure in both t-sne and umap, Nature biotechnology 39 (2) (2021) 156–157.doi:10.1038/ s41587-020-00809-z

work page 2021
[39]

Frankl, H

P. Frankl, H. Maehara, The johnson-lindenstrauss lemma and the sphericity of some graphs, Journal of Combinatorial Theory, Series B 44 (3) (1988) 355–362.doi:10. 1016/0095-8956(88)90043-3

work page 1988
[40]

M. J. Zellinger, P. Bühlmann, Natural language-based synthetic data generation for clus- ter analysis, Journal of Classification (2025) 1–27doi:10.1007/s00357-025-09501-w

work page doi:10.1007/s00357-025-09501-w 2025
[41]

Makarenkov, P

V. Makarenkov, P. Zentilli, D. Kevorkov, A. Gagarin, N. Malo, R. Nadon, An efficient method for the detection and elimination of systematic error in high-throughput screen- ing, Bioinformatics 23 (13) (2007) 1648–1657.doi:10.1093/bioinformatics/btm145

work page doi:10.1093/bioinformatics/btm145 2007
[42]

Arbelaitz, I

O. Arbelaitz, I. Gurrutxaga, J. Muguerza, J. M. Pérez, I. Perona, An extensive com- parative study of cluster validity indices, Pattern Recognition 46 (1) (2013) 243–256. doi:10.1016/j.patcog.2012.07.021

work page doi:10.1016/j.patcog.2012.07.021 2013
[43]

J. MacQueen, Some methods for classification and analysis of multivariate observations, in: Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Prob- ability, Volume 1: Statistics, Vol. 5, University of California press, 1967, pp. 281–298. URLhttp://projecteuclid.org/euclid.bsmsp/1200512992

work page arXiv 1967
[44]

R. C. de Amorim, V. Makarenkov, On k-means iterations and gaussian clusters, Neuro- computing 553 (2023) 126547

work page 2023
[45]

R. C. De Amorim, V. Makarenkov, Applying subclustering and lp distance in weighted k-means with distributed centroids, Neurocomputing 173 (2016) 700–707. 27

work page 2016
[46]

J. Peña, J. Lozano, P. Larrañaga, An empirical comparison of four initialization methods for the k-means algorithm, Pattern Recognition Letters 20 (10) (1999) 1027–1040.doi: https://doi.org/10.1016/S0167-8655(99)00069-0. URLhttps://www.sciencedirect.com/science/article/pii/S0167865599000690

work page doi:10.1016/s0167-8655(99)00069-0 1999
[47]

R. C. de Amorim, V. Makarenkov, Improving clustering quality evaluation in noisy gaussian mixtures, Neurocomputing (2026) 133330

work page 2026
[48]

P. H. A. Sneath, R. R. Sokal, Numerical taxonomy. The principles and practice of numerical classification., W. H. Freeman and Company, 1973

work page 1973
[49]

J. H. Wolfe, Pattern clustering by multivariate mixture analysis, Multivariate behavioral research 5 (3) (1970) 329–350.doi:10.1207/s15327906mbr0503\_6

work page doi:10.1207/s15327906mbr0503 1970
[50]

Ankerst, M

M. Ankerst, M. M. Breunig, H.-P. Kriegel, J. Sander, Optics: ordering points to identify the clustering structure, in: Proceedings of the 1999 ACM SIGMOD International Con- ference on Management of Data, SIGMOD ’99, Association for Computing Machinery, New York, NY, USA, 1999, p. 49–60.doi:10.1145/304182.304187

work page doi:10.1145/304182.304187 1999
[51]

Ester, H.-P

M. Ester, H.-P. Kriegel, J. Sander, X. Xu, et al., A density-based algorithm for discov- ering clusters in large spatial databases with noise, in: kdd, Vol. 96, 1996, pp. 226–231

work page 1996
[52]

Journal of Classification2(1), 193–218 (1985) https://doi.org/10.1007/BF01908075

L. Hubert, P. Arabie, Comparing partitions, Journal of classification 2 (1985) 193–218. doi:10.1007/BF01908075

work page doi:10.1007/bf01908075 1985
[53]

C. Ding, X. He, H. Zha, H. Simon, Adaptive dimension reduction for clustering high dimensional data, in: 2002 IEEE International Conference on Data Mining, 2002. Pro- ceedings., 2002, pp. 147–154.doi:10.1109/ICDM.2002.1183897

work page doi:10.1109/icdm.2002.1183897 2002
[54]

B. Tang, M. Shepherd, E. Milios, M. I. Heywood, Comparing and combining dimension reduction techniques for efficient text clustering, in: Proceeding of SIAM international workshop on feature selection for data mining, 2005, pp. 17–26

work page 2005
[55]

Karypis, E

G. Karypis, E. Han, Concept indexing: A fast dimensionality reduction algorithm with applications to document retrieval & categorization (2000)

work page 2000
[56]

B. M. S. Hasan, A. M. Abdulazeez, A review of principal component analysis algorithm for dimensionality reduction, Journal of Soft Computing and Data Mining 2 (1) (2021) 20–30

work page 2021
[57]

J. Xie, R. Girshick, A. Farhadi, Unsupervised deep embedding for clustering analysis, in: International conference on machine learning, PMLR, 2016, pp. 478–487

work page 2016
[58]

Jiang, Y

Z. Jiang, Y. Zheng, H. Tan, B. Tang, H. Zhou, Variational deep embedding: An unsu- pervised and generative approach to clustering, arXiv preprint arXiv:1611.05148 (2016). 28 Appendix A. Detailed ARI Scores for Synthetic and Real-World Datasets This appendix presents complete ARI results for all clustering algorithms (k-means, AHC, GMM, and OPTICS) applied...

work page arXiv 2016