Evaluation of the number of clusters in a data set using p-values from Multiple Tests of Hypotheses
Pith reviewed 2026-05-21 02:45 UTC · model grok-4.3
The pith
Combining p-values from multiple nonparametric tests on interpoint distances determines the number of clusters in a dataset.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The central claim is that interpoint distances computed from a given data set can serve as the basis for a collection of univariate nonparametric hypothesis tests; the p-values from these tests can then be combined in a stepwise decision process that identifies the true number of clusters present, providing an efficient and accurate alternative to existing cluster accuracy indices when used with any standard clustering algorithm.
What carries the argument
Stepwise combination of p-values obtained from univariate nonparametric tests performed on interpoint distances.
If this is right
- The index can be paired with any clustering algorithm that accepts a pre-specified number of clusters as input.
- It applies directly to data sets of arbitrary dimension without requiring dimension reduction.
- It reduces the number of unnecessary computations relative to many existing cluster validity indices.
- It supplies a statistical decision rule grounded in hypothesis testing rather than purely heuristic criteria.
Where Pith is reading between the lines
- The dependence among interpoint distances may require a specific multiple-testing adjustment that the paper leaves implicit; explicit simulation checks across increasing dimensions would clarify the robustness.
- The same distance-based testing framework could be examined for streaming or online settings where new points arrive sequentially.
- Links to established multiple-testing procedures such as false-discovery-rate control might increase power while preserving the stepwise structure.
Load-bearing premise
The interpoint distances under the null hypothesis of no clustering structure yield p-values that can be validly combined in a stepwise manner without distortion from their mutual dependence or from the multiplicity of tests.
What would settle it
Apply the procedure to synthetic data generated from a known mixture of well-separated Gaussian components and observe whether the stepwise p-value process correctly stops at the true number of components, or fails to recover that number when the components are allowed to overlap heavily.
read the original abstract
This paper proposes a novel, nonparametric, interpoint distance-based measure to investigate whether there exist any groups in a set of given data, and if so then, how many groups are prevailing in total. It is a cluster accuracy index useful for arbitrary-dimensional data set, in association with any clustering algorithm having the number of groups specified as a priori. We perform univariate, nonparametric, multiple statistical tests of hypotheses, where as many dependent tests as the sample size are carried out using the interpoint distances. They possess $p$-values to be combined to reach a decision, which is taken in a step-wise process for a possible number of clusters. It reduces the unnecessary computations compared with the other accuracy measures from the literature. Data study establishes the proposed index's efficiency and superiority.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes a nonparametric interpoint distance-based index for determining the number of clusters in a dataset. It performs univariate nonparametric hypothesis tests on the interpoint distances (one per distance, hence roughly n tests for n points), obtains p-values, and combines them via a stepwise rule to select the number of clusters k when used in conjunction with any clustering algorithm that takes k as input. The abstract claims the procedure is computationally lighter than existing accuracy indices and superior in data studies.
Significance. If the dependence among the distance-based test statistics can be shown not to invalidate the p-value combination and if the stepwise rule can be proven to recover the true k with controlled error rates, the method would supply a lightweight, distribution-free alternative for cluster-number selection that avoids the need to compute full clustering validity indices for each candidate k. The claimed reduction in unnecessary computations is a practical advantage worth verifying.
major comments (2)
- The abstract states that 'as many dependent tests as the sample size are carried out using the interpoint distances' and that p-values are 'combined to reach a decision' in a 'step-wise process.' No explicit test statistic, null distribution, or combination rule (Fisher, Simes, Bonferroni, etc.) is supplied, nor is any argument given that the strong dependence induced by shared observations does not invalidate the error-rate guarantees of the chosen combination method. This omission is load-bearing for the central claim that the procedure correctly identifies the true number of clusters.
- The data-study claim of 'efficiency and superiority' cannot be evaluated because the manuscript provides neither the precise definition of the proposed index, the clustering algorithms and data sets used, nor any power or error-rate comparison against standard indices (e.g., silhouette, gap statistic, or Davies-Bouldin). Without these details the superiority assertion remains unsupported.
minor comments (1)
- The abstract refers to 'univariate, nonparametric, multiple statistical tests of hypotheses' without naming the underlying nonparametric test (Wilcoxon, Kolmogorov-Smirnov, etc.) or the precise hypothesis being tested for each interpoint distance.
Simulated Author's Rebuttal
We are grateful to the referee for the constructive feedback. Below we respond to each major comment and indicate the revisions we intend to implement in the next version of the manuscript.
read point-by-point responses
-
Referee: The abstract states that 'as many dependent tests as the sample size are carried out using the interpoint distances' and that p-values are 'combined to reach a decision' in a 'step-wise process.' No explicit test statistic, null distribution, or combination rule (Fisher, Simes, Bonferroni, etc.) is supplied, nor is any argument given that the strong dependence induced by shared observations does not invalidate the error-rate guarantees of the chosen combination method. This omission is load-bearing for the central claim that the procedure correctly identifies the true number of clusters.
Authors: We thank the referee for this insightful comment. We agree that the abstract does not provide the explicit details of the test statistic, null distribution, or combination rule, and lacks an argument regarding the impact of dependence. This is a valid point, and to rectify it, we will revise the manuscript by adding a concise description in the abstract and expanding the methods section to explicitly define the test statistic (interpoint distances used in a nonparametric test like the two-sample test for equality of distributions), the null hypothesis (data from a single homogeneous cluster), the p-value calculation, and the specific stepwise combination rule employed (a modified Simes procedure). Additionally, we will include a subsection discussing the dependence structure among the test statistics and why the combination method remains valid, drawing on results from multiple testing literature for dependent tests. We believe these additions will strengthen the central claim. revision: yes
-
Referee: The data-study claim of 'efficiency and superiority' cannot be evaluated because the manuscript provides neither the precise definition of the proposed index, the clustering algorithms and data sets used, nor any power or error-rate comparison against standard indices (e.g., silhouette, gap statistic, or Davies-Bouldin). Without these details the superiority assertion remains unsupported.
Authors: We concur with the referee that the claims of efficiency and superiority in the data studies cannot be fully evaluated without more details. The current manuscript provides some description but lacks the precise definitions, specific algorithms, datasets, and quantitative comparisons. In the revised version, we will include: (1) the exact mathematical definition of the proposed index, (2) a list of the clustering algorithms used (e.g., k-means, hierarchical), (3) the datasets employed (e.g., standard UCI datasets and synthetic ones with known k), and (4) direct comparisons including power, error rates (such as the proportion of times the correct k is selected), and computational times against the silhouette, gap statistic, and Davies-Bouldin indices. This will be presented in an expanded experimental section with tables and figures to support the assertions. revision: yes
Circularity Check
No significant circularity; derivation relies on external statistical tests rather than self-referential construction.
full rationale
The paper defines a new interpoint-distance-based index that applies univariate nonparametric hypothesis tests to distances and combines the resulting p-values in a stepwise decision rule for selecting the number of clusters. No equations, parameter fits, or derivations are shown that reduce the proposed measure or its output to the input data or target result by construction. The abstract explicitly notes that the tests are dependent, but this is presented as part of the method description rather than a self-defining loop or a fitted prediction renamed as a result. The approach is self-contained against external benchmarks of hypothesis testing and p-value combination; no load-bearing self-citations, uniqueness theorems, or ansatzes imported from prior author work appear in the provided text. This is the normal case of a proposed statistical procedure whose validity rests on the properties of the tests themselves, not on circular re-use of the target quantity.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption Interpoint distances under the null hypothesis of no clustering structure permit valid univariate nonparametric hypothesis tests.
- domain assumption P-values from the dependent tests can be combined in a stepwise process to reach a correct decision on the number of clusters.
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We perform univariate, nonparametric, multiple statistical tests of hypotheses, where as many dependent tests as the sample size are carried out using the interpoint distances. They possess p-values to be combined to reach a decision, which is taken in a step-wise process for a possible number of clusters.
-
IndisputableMonolith/Foundation/AbsoluteFloorClosure.leanabsolute_floor_iff_bare_distinguishability unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Our nonparametric, distribution-free, validity index is based on interpoint distances.
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Ahn, J., Marron, J. S., Muller, K. M., Chi, Y.-Y. (2007).The high- dimension, low-sample-size geometric representation holds under mild con- ditions, Biometrika,94, 760–766
work page 2007
-
[2]
Bai Z. and Saranadasa H. (1996).Effect of high dimension: by an example of a two sample problem.Stat Sinica,6, 311—329
work page 1996
-
[3]
Ball, G. H. and Hall, D. J. (1965).Isodata: A novel method of data anal- ysis and pattern classification. Stanford Research Institute, Menlo Park
work page 1965
-
[4]
Banfield J. and Raftery A. E. (1993).Model-based Gaussian and non- Gaussian clustering. Biometrics.49, 803–821
work page 1993
-
[5]
Cali´ nski, T. & Harabasz, J. (1974).A Dendrite Method for Cluster Anal- ysis. Communications in Statistics – Theory and Methods.3, 1–27
work page 1974
-
[6]
Campello, R. J. G. B., Moulavi, D., Sander, J. (2013).Density- Based Clustering Based on Hierarchical Density Estimates. Proceedings of the 17th Pacific-Asia Conference on Knowledge Discovery in Databases (PAKDD 2013). Lecture Notes in Computer Science.7819, 160–172
work page 2013
-
[7]
Celeux, G. and Govaert, G. (1995).Gaussian parsimonious clustering models.Pattern Recognition.28, 781–793
work page 1995
-
[8]
Cheng, D., Zhu, Q., Huang, J., Wu, Q. and Yang, L. (2019).A Novel Cluster Validity Index Based on Local Cores. IEEE Transactions on Neural Networks and Learning Systems.30, 985–999
work page 2019
-
[9]
Davies, D. L. and Bouldin, D. W. (1979).A cluster separation measure. IEEE Transactions on Pattern Analysis and Machine Intelligence.2, 224– 227
work page 1979
-
[10]
De, T., Chattopadhyay, T. and Chattopadhyay, A. K. (2014).Use of cross-correlation function to study formation mechanism of massive ellip- tical galaxies. Publications of the Astronomical Society of Australia,31, Article id: e407, pages 1–8
work page 2014
-
[11]
Dempster A. P., Laird N. M., Rubin D. B. (1977).Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statis- tical Society, Series B.39, 1–38. 17
work page 1977
-
[12]
Dunn, J. C. (1974).Well-separated clusters and optimal fuzzy partitions. Journal of Cybernetics.4, 95–104
work page 1974
-
[13]
Efron, B. and Tibshirani, R. (1993).An Introduction to the Bootstrap. Chapman and Hall, New York, London
work page 1993
-
[14]
Ester, M., Kriegel, H.-P., Sander, J. & Xu, X. (1996).A density-based algorithm for discovering clusters in large spatial databases with noise.Pro- ceedings of the Second International Conference on Knowledge Discovery and Data Mining (KDD-96). AAAI Press, Portland, Oregon, 226–231
work page 1996
-
[15]
Everitt, B. S., Landau, S. and Leese, M. (2001).Cluster Analysis. Arnold, London
work page 2001
-
[16]
Flury, B. and Riedwyl, H. (1988).Multivariate Statistics: A practical approach. Chapman & Hall, London
work page 1988
-
[17]
Frayley, C. and Raftery, A. E. (1998),How Many Clusters? Which Clus- tering Method? Answers via Model-Based Cluster Analysis. The Com- puter Journal.41, 578–588
work page 1998
-
[18]
Fraley C. and Raftery A. E. (1999).MCLUST: Software for model-based cluster analysis. Journal of Classification.,16, 297–306
work page 1999
-
[19]
Fraley, C. and Raftery, A. E. (2002).Model-based clustering, discrimi- nant analysis and density estimation. Journal of the American Statistical Association,97, 611–631
work page 2002
-
[20]
Fraley, C. and Raftery, A. E. (2003).Enhanced model-based clustering, density estimation, and discriminant analysis software: Mclust.Journal of Classification.20, 263–286
work page 2003
-
[21]
Fraley C. and Raftery A. E. (2007).Model-based methods of classifi- cation: using the mclust software in chemometrics.Journal of Statistical Software.18,1–13
work page 2007
-
[22]
Fraley, C., Raftery, A. E., Murphy, T. B., Scrucca, L. (2012).MCLUST version 4 for R: Normal mixture modeling for model-based clustering, clas- sification, and density estimation. Technical Report. Vol.597, Department of Statistics, University of Washington. 18
work page 2012
- [23]
-
[24]
Hartigan, J. A. (1975).Clustering Algorithms. John Wiley & Sons, New York, USA
work page 1975
-
[25]
Hartigan, J. A. and Wong, M. A. (1979).A K-means clustering algo- rithm. Applied Statistics.28, 100–108
work page 1979
-
[26]
Hogg, R. V., Mckean, J. W. and Craig, A. T. (2019).Introduction to Mathematical Statistics. Pearson Education, Boston
work page 2019
-
[27]
Hope, A. C. A. (1968).A simplified Monte Carlo significance test pro- cedure. Journal of the Royal Statistical Society Series B,30, 582–598
work page 1968
-
[28]
Hubert, L. and Arabie, P. (1985).Comparing Partitions, Journal of the Classification,2, 193–218
work page 1985
-
[29]
Jain, A. K. , Murty, M. N. and Flynn, P. J. (1999).Data clustering: a review. ACM Computing Surveys.31, 264–323
work page 1999
-
[30]
Joanes, D. N. and Gill, C. A. (1998).Comparing measures of sample skewness and kurtosis. The Statistician,47, 183–189
work page 1998
-
[31]
Johnson, R. A. and Wichern, D. W. (2007).Applied Multivariate Sta- tistical Analysis, Pearson Prentice Hall, New Jersey
work page 2007
-
[32]
Jung, S. and Marron, J. S. (2009).PCA consistency in high dimension, low sample size context. The Annals of Statistics,37, 4104–4130
work page 2009
-
[33]
Jureˇ ckov´ a, J. and Kalina, J. (2012).Nonparametric multivariate rank tests and their unbiasedness.Bernoulli,18, 229—251
work page 2012
-
[34]
Kass, R. E. and Raftery, A. E. (1995).Bayes Factors. Journal of the American Statistical Association.90, 773–795
work page 1995
-
[35]
Kaufman, L. and Rousseeuw, P. J. (2005).Finding Groups in Data: An Introduction to Cluster Analysis.John Wiley and Sons, New Jersey
work page 2005
-
[36]
Kost, J. T. and McDermott, M. P. (2002).Combining dependent p- values.Statistics & Probability Letters,60, 183—190. 19
work page 2002
-
[37]
Marozzi, M. (2015).Multivariate multidistance tests for high- dimensional low sample size case-control studies.Statistics in Medicine, 34, 1511—1526
work page 2015
-
[38]
Marozzi, M. (2016).Multivariate tests based on interpoint distances with application to magnetic resonance imaging.Statistical Methods in Medical Research,25, 2593–2610
work page 2016
-
[39]
McLachlan, G. and Peel, D. (2000).Finite Mixture Models. John Wiley and Sons, New York
work page 2000
-
[40]
Modak, S. (2019).Uncovering astrophysical phenomena related to galax- ies and other objects through statistical analysis.Ph.D. Thesis, University of Calcutta, Kolkata, India. URL: http://hdl.handle.net/10603/314773
work page 2019
-
[41]
(2021).Distinction of groups of gamma-ray bursts in the BATSE catalog through fuzzy clustering
Modak, S. (2021).Distinction of groups of gamma-ray bursts in the BATSE catalog through fuzzy clustering. Astronomy and Computing.34, Article id 100441, Pages 1–7
work page 2021
-
[42]
(2022).A new nonparametric interpoint distance-based mea- sure for assessment of clustering
Modak, S. (2022).A new nonparametric interpoint distance-based mea- sure for assessment of clustering. Journal of Statistical Computation and Simulation.92, 1062–1077
work page 2022
-
[43]
(2023a).Pointwise norm-based clustering of data in arbi- trary dimensional space
Modak, S. (2023a).Pointwise norm-based clustering of data in arbi- trary dimensional space. Communications in Statistics - Case Studies, Data Analysis and Applications,9, 121–134
-
[44]
(2023b).Validity index for clustered data in non-negative space
Modak, S. (2023b).Validity index for clustered data in non-negative space. Calcutta Statistical Association Bulletin,75, 60–71
-
[45]
(2023c).A new measure for assessment of clustering based on kernel density estimation
Modak, S. (2023c).A new measure for assessment of clustering based on kernel density estimation. Communications in Statistics – Theory and Methods,52, 5942-5951
-
[46]
(2024a).A new interpoint distance-based clustering algorithm using kernel density estimation
Modak, S. (2024a).A new interpoint distance-based clustering algorithm using kernel density estimation. Communications in Statistics - Simulation and Computation,53, 5323-5341
-
[47]
Modak, S. (2024b).Book Review: Finding Groups in Data: An In- troduction to Cluster Analysis, Leonard Kaufman & Peter J. Rousseeuw,
-
[48]
Journal of Applied Statistics,51, 1618-1620. 20
-
[49]
Modak, S. and Bandyopadhyay, U. (2019).A new nonparametric test for two sample multivariate location problem with application to astronomy. Journal of Statistical Theory and Applications,18, 136–146
work page 2019
-
[50]
Modak, S., Chattopadhyay, A. K. & Chattopadhyay, T. (2018).Clus- tering of gamma-ray bursts through kernel principal component analysis. Communications in Statistics – Simulation and Computation.47, 1088– 1102
work page 2018
-
[51]
Modak, S., Chattopadhyay, T. & Chattopadhyay, A. K. (2017).Two phase formation of massive elliptical galaxies: study through cross– correlation including spatial effect.Astrophysics and Space Science.362, Article id: 206, Pages 1–10
work page 2017
-
[52]
Modak, S., Chattopadhyay, T. & Chattopadhyay, A. K. (2020).Unsu- pervised classification of eclipsing binary light curves through k-medoids clustering. Journal of Applied Statistics.47, 376–392
work page 2020
-
[53]
Modak, S., Chattopadhyay, T. & Chattopadhyay, A. K. (2022).Cluster- ing of eclipsing binary light curves through functional principal component analysis. Astrophysics and Space Science.367, Article id: 19, Pages 1–10
work page 2022
-
[54]
Pakhiraa, M. K., Bandyopadhyay, S. and Maulik, U. (2004).Validity index for crisp and fuzzy clusters. Pattern Recognition.37, 487–501
work page 2004
-
[55]
Poole, W., Gibbs, D. L., Shmulevich,. I., Bernard, B., Knijnenburg, T. A. (2016).Combining dependent P-values with an empirical adaptation of Brown’s method.Bioinformatics,32, i430—i436
work page 2016
-
[56]
Ripley B. D. (1996).Pattern recognition and neural networks. Cam- bridge University Press, Cambridge
work page 1996
-
[57]
Rousseeuw, P. J. (1987).Silhouettes: A graphical aid to the interpre- tation and validation of cluster analysis.Journal of Computational and Applied Mathematics.20, 53–65
work page 1987
-
[58]
Sch¨ olkopf, B. and Smola, A. J. (2002).Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond.MIT Press, Cambridge. 21
work page 2002
-
[59]
(1978).Estimating the Dimension of a Model.The Annals of Statistics,6, 461–464
Schwarz, G. (1978).Estimating the Dimension of a Model.The Annals of Statistics,6, 461–464
work page 1978
-
[60]
Scrucca, L., Fop, M., Murphy, T. B. and Raftery, A. E. (2016).mclust 5: clustering, classification and density estimation using Gaussian finite mixture models. The R Journal,8, 289–317
work page 2016
-
[61]
Silva, L. E. Brito Da, Melton, N. M. and Wunsch, D. C. (2020).Incre- mental Cluster Validity Indices for Online Learning of Hard Partitions: Extensions and Comparative Study. Institute of Electrical and Electronics Engineers,8, 22025–22047
work page 2020
-
[62]
Tarnopolski, M. (2019).Analysis of the Duration–Hardness Ratio Plane of Gamma-Ray Bursts Using Skewed Distributions.The Astrophysical Journal.870, 1–9, Article id: 105
work page 2019
-
[63]
Tibshirani, R., Walther, G. & Hastie, T. (2001).Estimating the number of clusters in a data set via the gap statistic. Journal of the Royal Statistical Society Series B.63, 411–423
work page 2001
-
[64]
T´ oth, B. G., R´ acz, I. I. & Horv´ ath, I. (2019).Gaussian-mixture-model- based cluster analysis of gamma-ray bursts in the BATSE catalog. Monthly Notices of the Royal Astronomical Society.486, 4823–4828
work page 2019
-
[65]
Vale, D. C. and Maurelli V. A. (1983).Simulating multivariate nonnor- mal distributions. Psychometrika,48, 465–471
work page 1983
-
[66]
Yata, K. and Aoshima, M. (2010).Effective PCA for high-dimension, low-sample-size data with singular value decomposition of cross data ma- trix, Journal of Multivariate Analysis,101, 2060–2077. 22
work page 2010
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.