Scalable and Distributed Silhouette Approximation

Andrea Pietracaprina; Fabio Vandin; Federico Altieri; Geppino Pucci; Ilie Sarpe

arxiv: 2607.01993 · v1 · pith:QRFFJDM6new · submitted 2026-07-02 · 💻 cs.DS · cs.DC· cs.LG

Scalable and Distributed Silhouette Approximation

Ilie Sarpe , Federico Altieri , Andrea Pietracaprina , Geppino Pucci , Fabio Vandin This is my paper

Pith reviewed 2026-07-03 04:09 UTC · model grok-4.3

classification 💻 cs.DS cs.DCcs.LG

keywords silhouetteclustering qualityapproximation algorithmssamplingdistributed algorithmsMapReducemetric clustering

0 comments

The pith

Sampling estimates local and global silhouette of any metric k-clustering to additive error O(ε) with O(nkε^{-2}ln(nk/δ)) distance computations.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents the first algorithms with provable guarantees for approximating the silhouette, a standard measure of k-clustering quality that can be computed from the assignment alone. Exact evaluation needs all pairwise distances and therefore quadratic time, which is impractical for large datasets, and prior fast methods offered no controllable error bounds. The new sampling techniques reduce distance computations to a quantity linear in n for fixed k and ε while guaranteeing additive error O(ε) with high probability. The same sampling framework is adapted to run in the MapReduce and MPC models using a constant number of rounds and sublinear local memory. Experiments indicate that the resulting estimates achieve a superior accuracy-efficiency trade-off compared with existing heuristics on both local per-point scores and the global silhouette value.

Core claim

For any metric k-clustering the local silhouette of each point and the global silhouette of the clustering can be estimated to within additive error O(ε) by sampling O(nkε^{-2}ln(nk/δ)) distances, with success probability at least 1-δ, for arbitrary ε,δ in (0,1). The same sampling construction yields the first rigorous, controllable approximations; earlier fast methods were heuristics without such guarantees.

What carries the argument

Sampling to estimate the intra-cluster and nearest inter-cluster average distances that enter the silhouette formula for each point.

If this is right

Both per-point and aggregate silhouette values become computable on datasets too large for quadratic distance matrices.
The approximation works for arbitrary metric k-clusterings and requires no extra assumptions on data geometry.
Constant-round distributed implementations exist in MapReduce and MPC with sublinear memory per machine.
The accuracy-efficiency trade-off is controlled directly by the user-chosen parameters ε and δ.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Silhouette-based model selection or validation could be inserted into clustering pipelines on web-scale or streaming data where it was previously ruled out by cost.
The same sampling idea may apply to other quality measures that depend only on average intra- and inter-cluster distances.
In practice the method could be combined with existing fast clustering routines to produce both a clustering and a certified quality score in near-linear total work.

Load-bearing premise

Concentration inequalities on the sampled average distances suffice to bound the error on the silhouette terms to O(ε) without further restrictions on the metric or the point distribution.

What would settle it

On any dataset small enough for exact silhouette computation, run the sampling estimator many times with chosen ε and δ and check whether the fraction of trials whose estimates deviate by more than O(ε) exceeds δ.

Figures

Figures reproduced from arXiv: 2607.01993 by Andrea Pietracaprina, Fabio Vandin, Federico Altieri, Geppino Pucci, Ilie Sarpe.

**Figure 1.** Figure 1: (a): dataset 𝑉 = {𝑒1 . . . , 𝑒𝑛} for 𝑛 = 15 with elements in the Euclidean plane R 2 . Different shapes represent 𝑘 = 4 different clusters, that is C = {𝐶1 = {𝑒1, . . . , 𝑒3, 𝑒14},𝐶2 = {𝑒4, . . . , 𝑒10},𝐶3 = {𝑒11, 𝑒12, 𝑒13},𝐶4 = {𝑒15}}. (b): silhouette 𝑠(𝑒) of the elements 𝑒 ∈ 𝑉 (where the distance is the Euclidean distance): values are grouped by clusters and sorted. The dashed line represents the value o… view at source ↗

**Figure 2.** Figure 2: Instance used to prove Theorem 2 (see Section 4.1.3). 𝐶1 = {𝑒1, . . . , 𝑒𝑚+1} and 𝐶2 = {𝑒 ′ 1 , . . . , 𝑒′ 𝑚+1 }. that any (random) choice of set 𝐹 0 𝐶𝑗 will yield E[|𝐹𝐶𝑗 |] = O (𝑡). Applying the Chernoff bound, we obtain that |𝐹𝐶𝑗 | = O(𝑡) with probability 1 −𝛿/(5𝑘) by the choice of 𝑡 as from Lemma 1, proving our statement. □ Observe that the number of distance computations performed by silh-pps-all is bo… view at source ↗

**Figure 3.** Figure 3: Trade-off between accuracy and efficiency over medium sized datasets (see [PITH_FULL_IMAGE:figures/full_fig_p029_3.png] view at source ↗

**Figure 4.** Figure 4: Methods comparison. The 𝑥 axis (in log scale) is associated with the average runtime, and the 𝑦 axis with the average error (and its standard deviation) over 10 independent runs. For each clustered dataset we considered different values of the expected sample size 𝑡 ∈ {32, 64, 128}. 30 [PITH_FULL_IMAGE:figures/full_fig_p030_4.png] view at source ↗

**Figure 5.** Figure 5: Plots of the average maximum error and its standard deviation obtained on all elements [PITH_FULL_IMAGE:figures/full_fig_p032_5.png] view at source ↗

**Figure 6.** Figure 6: Plots of the average and maximum errors within each cluster achieved by [PITH_FULL_IMAGE:figures/full_fig_p034_6.png] view at source ↗

**Figure 7.** Figure 7: Average parallel speedup over five independent runs by running [PITH_FULL_IMAGE:figures/full_fig_p035_7.png] view at source ↗

**Figure 8.** Figure 8: Silhouette plot construction. (Left): Silhouette plot obtained from the exact algorithm. [PITH_FULL_IMAGE:figures/full_fig_p036_8.png] view at source ↗

**Figure 9.** Figure 9: Plots of Cumulative accuracy(ℓ,b𝒌), where ℓ ∈ [1, 14] and b𝒌 is either 𝒌 pps or 𝒌 uni. For each value of ℓ, the plots report the cumulative accuracy of each estimation method over 200 runs: values closer to 1, denote a higher accuracy for the identification of the best value of 𝑘. they are accurate and their computation is scalable (see summary for Issue I1 in Section 5.2). Thus, they can be used to comput… view at source ↗

**Figure 10.** Figure 10: Methods comparison. The 𝑥 axis is associated with the average runtime, and the 𝑦 axis with the average error (and its standard deviation) over 10 independent runs. For each clustered dataset we considered different values of the expected sample size 𝑡 ∈ {32, 64, 128}, as illustrated. 48 [PITH_FULL_IMAGE:figures/full_fig_p048_10.png] view at source ↗

**Figure 11.** Figure 11: Plots of the average maximum error and its standard deviation obtained on all elements [PITH_FULL_IMAGE:figures/full_fig_p049_11.png] view at source ↗

**Figure 12.** Figure 12: For each cluster in C we show the average and maximum errors achieved by silh-ppsall (pps) and the uniform sampling based approach from Section 4.1.3 (uni). Each dot represent an independent run. 50 [PITH_FULL_IMAGE:figures/full_fig_p050_12.png] view at source ↗

read the original abstract

The silhouette is one of the most widely used measures to assess the quality of a $k$-clustering of a dataset of $n$ elements. Its evaluation requires no information beyond the clustering assignment. In addition, the silhouette is extremely easy to interpret, providing a score to measure the quality of a clustering as a whole or for each element. The exact computation of the: (i) silhouette of each element of a dataset; and (ii) the global silhouette of the clustering; require $\Theta(n^2)$ distance calculations, under general metrics. The quadratic complexity $\Theta(n^2)$ is extremely prohibitive, especially on massive modern datasets. Surprisingly, existing approximate methods using $O(n^2)$ distance calculations are heuristics not offering provable and controllable guarantees on the quality of their results. We introduce the first rigorous and efficient algorithms to estimate: (i) the (local) silhouette of each element of a dataset; and (ii) the (global) silhouette; of any metric $k$-clustering. Our methods, based on sampling, perform $O(nk\varepsilon^{-2}\ln (nk/\delta))$ distance computations, and provide estimates with additive error $O(\varepsilon)$ with probability at least $1-\delta$. That is, parameters $\varepsilon$ and $\delta$ in $(0,1)$ control the trade-off between accuracy and efficiency. We also introduce a scalable and distributed design of our methods for the MapReduce and Massively Parallel Computing (MPC) frameworks. Our distributed algorithms use a constant number of rounds and sublinear local memory. Finally, we perform extensive experiments against state-of-the-art approaches. The results show that our new techniques yield the best trade-off between accuracy and efficiency for both local and global silhouette estimation. In addition, our methods scale efficiently to massive datasets for which an exact computation of the silhouette is not practical.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Sampling gives the first provable additive-error bounds for silhouette but the ratio in s(i) can amplify small additive errors on a(i) and b(i) when those are near zero.

read the letter

The core advance is a sampling scheme that estimates per-point silhouettes and the global one with additive O(ε) error after O(nk ε^{-2} ln(nk/δ)) distance computations, plus constant-round MapReduce and MPC versions that use sublinear local memory.

It improves on the heuristics that previously existed by supplying explicit, controllable guarantees and by showing practical scaling on large instances where exact quadratic computation is impossible.

The soft spot is the error claim itself. Silhouette is (b-a)/max(a,b). Additive concentration on the averages a and b (via Hoeffding or similar) does not yield additive concentration on the ratio when a or b can be arbitrarily small; the gradient diverges as those terms approach zero, so |â-a|≤ε and |b̂-b|≤ε can produce Ω(1) error on s even for tiny ε. The abstract states additive O(ε) on the silhouette values for arbitrary metrics without a minimum-distance assumption or a refined analysis, so that central guarantee needs checking.

The work is aimed at people who run clustering pipelines on big data and want a quality metric they can trust without quadratic cost. A reader focused on approximation algorithms or distributed clustering evaluation will find the sampling and communication bounds useful.

It deserves a serious referee to examine the proofs and see whether the error analysis closes the gap on the ratio.

Referee Report

1 major / 0 minor

Summary. The manuscript introduces the first sampling-based algorithms for approximating the local silhouette s(i) of each point and the global silhouette of a metric k-clustering. The algorithms use O(nk ε^{-2} ln(nk/δ)) distance computations to obtain additive O(ε) error with probability ≥1-δ; they are also extended to constant-round MapReduce and MPC distributed models with sublinear local memory. Experiments demonstrate favorable accuracy-efficiency trade-offs versus prior heuristics on large datasets.

Significance. If the stated additive-error guarantees hold, the work supplies the first provably controllable approximation algorithms for silhouette computation, removing the Θ(n²) barrier that has limited its use on massive data. The explicit complexity bound, distributed implementations, and experimental validation constitute a concrete advance for scalable clustering evaluation in theoretical computer science.

major comments (1)

[Sections presenting the local-silhouette estimator and its error analysis (abstract claim and algorithm introduction)] The central claim of additive O(ε) error on the silhouette values s(i) = (b(i)-a(i))/max(a(i),b(i)) (and on the global average) rests on additive concentration bounds for the averages a(i) and b(i). Because the map (a,b)↦s is not Lipschitz with constant independent of a and b (its gradient diverges as a,b→0), |â-a|≤O(ε) and |b̂-b|≤O(ε) can produce |ŝ-s|=Ω(1) even for arbitrarily small ε. No case analysis for small a(i),b(i), lower-bound assumption, or refined (e.g., multiplicative) analysis appears to close this gap.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the careful and insightful review. The observation regarding error propagation through the silhouette function is a substantive technical point that we address below. We will revise the manuscript to strengthen the analysis.

read point-by-point responses

Referee: [Sections presenting the local-silhouette estimator and its error analysis (abstract claim and algorithm introduction)] The central claim of additive O(ε) error on the silhouette values s(i) = (b(i)-a(i))/max(a(i),b(i)) (and on the global average) rests on additive concentration bounds for the averages a(i) and b(i). Because the map (a,b)↦s is not Lipschitz with constant independent of a and b (its gradient diverges as a,b→0), |â-a|≤O(ε) and |b̂-b|≤O(ε) can produce |ŝ-s|=Ω(1) even for arbitrarily small ε. No case analysis for small a(i),b(i), lower-bound assumption, or refined (e.g., multiplicative) analysis appears to close this gap.

Authors: We acknowledge that the referee correctly identifies a gap: the map (a,b) → s(a,b) is not uniformly Lipschitz, so additive O(ε) approximations to a(i) and b(i) do not automatically imply additive O(ε) error in s(i) when a(i) or b(i) are small. The current manuscript provides concentration bounds only for the estimators of a(i) and b(i) and does not contain an explicit case analysis, lower-bound assumption on a(i),b(i), or multiplicative refinement for the silhouette values. We will revise the sections on the local-silhouette estimator and its analysis (and the corresponding global-silhouette claims) to close this gap. The revision will introduce a case distinction: when max(a(i),b(i)) ≥ cε for a suitable constant c, the local Lipschitz constant is bounded and the additive error carries through; when max(a(i),b(i)) = O(ε), we will either (i) output a conservative estimate with an explicit larger error bound or (ii) state that the additive O(ε) guarantee on s(i) holds under the additional assumption that a(i),b(i) = Ω(ε). We believe this addresses the concern without altering the sampling complexity. revision: yes

Circularity Check

0 steps flagged

No circularity; sampling-based bounds derived independently

full rationale

The paper presents sampling algorithms whose error guarantees rest on standard concentration inequalities applied directly to the definitional averages a(i) and b(i) that enter the silhouette formula. No step reduces the claimed O(ε) additive guarantee on the silhouette values to a fitted parameter, a self-citation chain, or a redefinition of the target quantity; the sampling complexity and probability bounds are obtained from first-principles tail bounds that do not presuppose the final result. The derivation is therefore self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Abstract-only review prevents exhaustive identification; the sampling approach implicitly relies on standard probabilistic concentration results.

axioms (1)

standard math Standard concentration inequalities (e.g., Hoeffding) apply to the per-point intra- and nearest-cluster average distance estimators used in silhouette.
The stated O(ε) additive error with 1-δ probability is the typical output of such bounds applied to sampling.

pith-pipeline@v0.9.1-grok · 5892 in / 1312 out tokens · 32529 ms · 2026-07-03T04:09:57.926941+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

62 extracted references · 22 canonical work pages

[1]

Fully Scalable

Artur Czumaj and Guichen Gao and Mohsen Ghaffari and Shaofeng H. Fully Scalable. Proceedings of the 52nd International Colloquium on Automata, Languages, and Programming,
[2]

Dubhashi and Alessandro Panconesi , title =

Devdatt P. Dubhashi and Alessandro Panconesi , title =. 2009 , isbn =

2009
[3]

Proceedings of the Eighteenth Annual

David Arthur and Sergei Vassilvitskii , title =. Proceedings of the Eighteenth Annual
[4]

Journal of Open Source Software , volume=

Fast k-medoids Clustering in Rust and Python , author=. Journal of Open Source Software , volume=
[5]

Periodica Polytechnica Civil Engineering , volume=

Upper bound of density for packing of equal circles in special domains in the plane , author=. Periodica Polytechnica Civil Engineering , volume=
[6]

Java-ML: A Machine Learning Library , year =

Thomas Abeel and Yves Van de Peer and Yvan Saeys , journal =. Java-ML: A Machine Learning Library , year =
[7]

Data Clustering , year =
[8]

2013 , doi =

Awasthi Pranjal and Maria Florina Balcan , title =. 2013 , doi =

2013
[9]

Scalable k-means++ , year =

Bahman Bahmani and Benjamin Moseley and Andrea Vattani and Ravi Kumar and Sergei Vassilvitskii , journal =. Scalable k-means++ , year =. doi:10.14778/2180912.2180915 , publisher =

work page doi:10.14778/2180912.2180915
[10]

Distributed k-Means and k-Median Clustering on General Topologies , year =

Maria Florina Balcan and Steven Ehrlich and Yingyu Liang , journal =. Distributed k-Means and k-Median Clustering on General Topologies , year =
[11]

k-Means for Streaming and Distributed Big Sparse Data , year =

Artem Barger and Dan Feldman , booktitle =. k-Means for Streaming and Distributed Big Sparse Data , year =
[12]

Paul Beame and Paraschos Koutris and Dan Suciu , title =. J
[13]

Clustering uncertain graphs , year =

Matteo Ceccarello and Carlo Fantozzi and Andrea Pietracaprina and Geppino Pucci and Fabio Vandin , journal =. Clustering uncertain graphs , year =. doi:10.1145/3186728.3164143 , publisher =

work page doi:10.1145/3186728.3164143
[14]

Solving k-center clustering (with outliers) in

Matteo Ceccarello and Andrea Pietracaprina and Geppino Pucci , journal =. Solving k-center clustering (with outliers) in. 2019 , month =. doi:10.14778/3317315.3317319 , publisher =

work page doi:10.14778/3317315.3317319 2019
[15]

Average Distance Queries through Weighted Samples in Graphs and Metric Spaces: High Scalability with Tight Statistical Guarantees , year =

Shiri Chechik and Edith Cohen and Haim Kaplan , booktitle =. Average Distance Queries through Weighted Samples in Graphs and Metric Spaces: High Scalability with Tight Statistical Guarantees , year =
[16]

Clustering Small Samples With Quality Guarantees: Adaptivity With One2all

Edith Cohen and Shiri Chechik and Haim Kaplan , booktitle =. Clustering Small Samples With Quality Guarantees: Adaptivity With One2all. 2018 , editor =

2018
[17]

2008 , month =

Jeffrey Dean and Sanjay Ghemawat , journal =. 2008 , month =. doi:10.1145/1327452.1327492 , publisher =

work page doi:10.1145/1327452.1327492 2008
[18]

Fast clustering using

Alina Ene and Sungjin Im and Benjamin Moseley , booktitle =. Fast clustering using. 2011 , publisher =

2011
[19]

A unified framework for approximating and clustering data , year =

Dan Feldman and Michael Langberg , booktitle =. A unified framework for approximating and clustering data , year =
[20]

Gereon Frahling and Christian Sohler , journal =. A. 2008 , month =. doi:10.1142/s0218195908002787 , publisher =

work page doi:10.1142/s0218195908002787 2008
[21]

Data mining: concepts and techniques , year =

Jiawei Han and Micheline Kamber and Jian Pei , publisher =. Data mining: concepts and techniques , year =
[22]

Handbook of Cluster Analysis , year =

Christian Hennig and Marina Meila and Fionn Murtagh and Roberto Rocci , publisher =. Handbook of Cluster Analysis , year =
[23]

Shahriar Hossain and Rafal A

M. Shahriar Hossain and Rafal A. Angryk , booktitle =. 2007 , month =

2007
[24]

Hruschka and L.N

E.R. Hruschka and L.N. de Castro and R.J.G.B. Campello , booktitle =. Evolutionary Algorithms for Clustering Gene-Expression Data , publisher =
[25]

Karloff and Siddharth Suri and Sergei Vassilvitskii , booktitle =

Howard J. Karloff and Siddharth Suri and Sergei Vassilvitskii , booktitle =. A Model of Computation for MapReduce , year =
[26]

Mining of Massive Datasets , year =

Jure Leskovec and Anand Rajaraman and Jeffrey David Ullman , publisher =. Mining of Massive Datasets , year =
[27]

Lloyd , journal =

S. Lloyd , journal =. Least squares quantization in. 1982 , month =. doi:10.1109/tit.1982.1056489 , publisher =

work page doi:10.1109/tit.1982.1056489 1982
[28]

Kusner and Wenlin Chen and Kilian Q

Gustavo Malkomes and Matt J. Kusner and Wenlin Chen and Kilian Q. Weinberger and Benjamin Moseley , booktitle =. Fast Distributed k-Center Clustering with Outliers on Massive Data , year =
[29]

Accurate MapReduce Algorithms for k-Median and k-Means in General Metric Spaces , year =

Alessio Mazzetto and Andrea Pietracaprina and Geppino Pucci , booktitle =. Accurate MapReduce Algorithms for k-Median and k-Means in General Metric Spaces , year =
[30]

Jaskowiak and Ricardo J

Davoud Moulavi and Pablo A. Jaskowiak and Ricardo J. G. B. Campello and Arthur Zimek and J. Density-Based Clustering Validation , year =. Proceedings of the 2014

2014
[31]

Ng and Jiawei Han , booktitle =

Raymond T. Ng and Jiawei Han , booktitle =. Efficient and Effective Clustering Methods for Spatial Data Mining , year =
[32]

Space-round tradeoffs for MapReduce computations , year =

Andrea Pietracaprina and Geppino Pucci and Matteo Riondato and Francesco Silvestri and Eli Upfal , booktitle =. Space-round tradeoffs for MapReduce computations , year =
[33]

Rousseeuw , journal =

Peter J. Rousseeuw , journal =. Silhouettes: A graphical aid to the interpretation and validation of cluster analysis , year =. doi:10.1016/0377-0427(87)90125-7 , publisher =

work page doi:10.1016/0377-0427(87)90125-7
[34]

Kersten , journal =

Thibault Sellam and Robin Cijvat and Richard Koopmanschap and Martin L. Kersten , journal =. Blaeu: Mapping and Navigating Large Tables with Cluster Analysis , year =. doi:10.14778/3007263.3007288 , url =

work page doi:10.14778/3007263.3007288
[35]

2005 , address =

Pang-Ning Tan and Michael Steinbach and Vipin Kumar , publisher =. 2005 , address =

2005
[36]

Aggarwal and Yelong Shen , booktitle =

Lin Liu and Ruoming Jin and Charu C. Aggarwal and Yelong Shen , booktitle =. Reliable Clustering on Uncertain Graphs , year =
[37]

Emmendorfer and Eduardo Nunes Borges and Karina S

Caroline Tomasini and Leonardo R. Emmendorfer and Eduardo Nunes Borges and Karina S. Machado , booktitle =. A methodology for selecting the most suitable cluster validation internal indices , year =
[38]

An Analysis of the Application of Simplified Silhouette to the Evaluation of k-means Clustering Validity , year =

Fei Wang and Hector. An Analysis of the Application of Simplified Silhouette to the Evaluation of k-means Clustering Validity , year =. Machine Learning and Data Mining in Pattern Recognition - 13th International Conference,. doi:10.1007/978-3-319-62416-7\_21 , url =

work page doi:10.1007/978-3-319-62416-7
[39]

Comparing the performance of biomedical clustering methods , year =

Christian Wiwie and Jan Baumbach and Richard Röttger , journal =. Comparing the performance of biomedical clustering methods , year =. doi:10.1038/nmeth.3583 , publisher =

work page doi:10.1038/nmeth.3583
[40]

Clustering Validation Measures , year =

Hui Xiong and Zhongmou Li , booktitle =. Clustering Validation Measures , year =
[41]

2009 , month = jul, abstract =

Andreas Maurer and Massimiliano Pontil , title =. 2009 , month = jul, abstract =

2009
[42]

Analytica Chimica Acta , title =

Llet. Analytica Chimica Acta , title =. 2004 , issn =. doi:10.1016/j.aca.2003.12.020 , publisher =

work page doi:10.1016/j.aca.2003.12.020 2004
[43]

Silhouette Index as Clustering Evaluation Tool , year =

Dudek, Andrzej , pages =. Silhouette Index as Clustering Evaluation Tool , year =. Classification and Data Analysis , doi =
[44]

Probability and Computing , year =

Michael Mitzenmacher and Eli Upfal , publisher =. Probability and Computing , year =
[45]

, journal =

Schubert, Erich and Rousseeuw, Peter J. , journal =. Fast and eager k -medoids clustering:. 2021 , issn =. doi:10.1016/j.is.2021.101804 , publisher =

work page doi:10.1016/j.is.2021.101804 2021
[46]

A new partitioning around medoids algorithm , year =

Van der Laan, Mark and Pollard, Katherine and Bryan, Jennifer , journal =. A new partitioning around medoids algorithm , year =. doi:10.1080/0094965031000136012 , publisher =

work page doi:10.1080/0094965031000136012
[47]

Medoid Silhouette clustering with automatic cluster number selection , year =

Lenssen, Lars and Schubert, Erich , journal =. Medoid Silhouette clustering with automatic cluster number selection , year =. doi:10.1016/j.is.2023.102290 , publisher =

work page doi:10.1016/j.is.2023.102290 2023
[48]

and Perona, Iñigo , journal =

Arbelaitz, Olatz and Gurrutxaga, Ibai and Muguerza, Javier and Pérez, Jesús M. and Perona, Iñigo , journal =. An extensive comparative study of cluster validity indices , year =. doi:10.1016/j.patcog.2012.07.021 , publisher =

work page doi:10.1016/j.patcog.2012.07.021 2012
[49]

Clustering by Means of Medoids , year =

Kaufmann, Leonard and Rousseeuw, Peter , journal =. Clustering by Means of Medoids , year =
[50]

Silhouette coefficient-based weighting k-means algorithm , year =

Lai, Huixia and Huang, Tao and Lu, BinLong and Zhang, Shi and Xiaog, Ruliang , journal =. Silhouette coefficient-based weighting k-means algorithm , year =. doi:10.1007/s00521-024-10706-0 , publisher =

work page doi:10.1007/s00521-024-10706-0
[51]

Cluster Quality Analysis Using Silhouette Score , year =

Shahapure, Ketan Rajshekhar and Nicholas, Charles , booktitle =. Cluster Quality Analysis Using Silhouette Score , year =
[52]

Scalable Distributed Approximation of Internal Measures for Clustering Evaluation , year =

Altieri, Federico and Pietracaprina, Andrea and Pucci, Geppino and Vandin, Fabio , pages =. Scalable Distributed Approximation of Internal Measures for Clustering Evaluation , year =. Proceedings of the 2021 SIAM International Conference on Data Mining (SDM) , doi =

2021
[53]

A Comprehensive Survey of Clustering Algorithms , year =

Xu, Dongkuan and Tian, Yingjie , journal =. A Comprehensive Survey of Clustering Algorithms , year =. doi:10.1007/s40745-015-0040-1 , publisher =

work page doi:10.1007/s40745-015-0040-1
[54]

and Ezugwu, Absalom E

Ikotun, Abiodun M. and Ezugwu, Absalom E. and Abualigah, Laith and Abuhaija, Belal and Heming, Jia , journal =. K-means clustering algorithms: A comprehensive review, variants analysis, and advances in the era of big data , year =. doi:10.1016/j.ins.2022.11.139 , publisher =

work page doi:10.1016/j.ins.2022.11.139 2022
[55]

A Fast and High Quality Multilevel Scheme for Partitioning Irregular Graphs , year =

Karypis, George and Kumar, Vipin , journal =. A Fast and High Quality Multilevel Scheme for Partitioning Irregular Graphs , year =. doi:10.1137/s1064827595287997 , publisher =

work page doi:10.1137/s1064827595287997
[56]

Advances in neural information processing systems , volume=

An impossibility theorem for clustering , author=. Advances in neural information processing systems , volume=
[57]

Stop using the elbow criterion for k-means and how to choose the number of clusters instead , year =

Schubert, Erich , journal =. Stop using the elbow criterion for k-means and how to choose the number of clusters instead , year =. doi:10.1145/3606274.3606278 , publisher =

work page doi:10.1145/3606274.3606278
[58]

Understanding of Internal Clustering Validation Measures , year =

Liu, Yanchi and Li, Zhongmou and Xiong, Hui and Gao, Xuedong and Wu, Junjie , booktitle =. Understanding of Internal Clustering Validation Measures , year =
[59]

and Tayfor, Noor Bahjat and Hassan, Alla A

Hassan, Bryar A. and Tayfor, Noor Bahjat and Hassan, Alla A. and Ahmed, Aram M. and Rashid, Tarik A. and Abdalla, Naz N. , journal =. From A-to-Z review of clustering validation indices , year =. doi:10.1016/j.neucom.2024.128198 , publisher =

work page doi:10.1016/j.neucom.2024.128198 2024
[60]

Silhouette scores for assessment of SNP genotype clusters , year =

Lovmar, Lovisa and Ahlford, Annika and Jonsson, Mats and Syvänen, Ann-Christine , journal =. Silhouette scores for assessment of SNP genotype clusters , year =. doi:10.1186/1471-2164-6-35 , publisher =

work page doi:10.1186/1471-2164-6-35
[61]

Estimating the Optimal Number of Clusters in Categorical Data Clustering by Silhouette Coefficient , year =

Dinh, Duy-Tai and Fujinami, Tsutomu and Huynh, Van-Nam , pages =. Estimating the Optimal Number of Clusters in Categorical Data Clustering by Silhouette Coefficient , year =. Knowledge and Systems Sciences , doi =
[62]

2023 , volume =

Im, Sungjin and Kumar, Ravi and Lattanzi, Silvio and Moseley, Benjamin and Vassilvitskii, Sergei , title =. 2023 , volume =

2023

[1] [1]

Fully Scalable

Artur Czumaj and Guichen Gao and Mohsen Ghaffari and Shaofeng H. Fully Scalable. Proceedings of the 52nd International Colloquium on Automata, Languages, and Programming,

[2] [2]

Dubhashi and Alessandro Panconesi , title =

Devdatt P. Dubhashi and Alessandro Panconesi , title =. 2009 , isbn =

2009

[3] [3]

Proceedings of the Eighteenth Annual

David Arthur and Sergei Vassilvitskii , title =. Proceedings of the Eighteenth Annual

[4] [4]

Journal of Open Source Software , volume=

Fast k-medoids Clustering in Rust and Python , author=. Journal of Open Source Software , volume=

[5] [5]

Periodica Polytechnica Civil Engineering , volume=

Upper bound of density for packing of equal circles in special domains in the plane , author=. Periodica Polytechnica Civil Engineering , volume=

[6] [6]

Java-ML: A Machine Learning Library , year =

Thomas Abeel and Yves Van de Peer and Yvan Saeys , journal =. Java-ML: A Machine Learning Library , year =

[7] [7]

Data Clustering , year =

[8] [8]

2013 , doi =

Awasthi Pranjal and Maria Florina Balcan , title =. 2013 , doi =

2013

[9] [9]

Scalable k-means++ , year =

Bahman Bahmani and Benjamin Moseley and Andrea Vattani and Ravi Kumar and Sergei Vassilvitskii , journal =. Scalable k-means++ , year =. doi:10.14778/2180912.2180915 , publisher =

work page doi:10.14778/2180912.2180915

[10] [10]

Distributed k-Means and k-Median Clustering on General Topologies , year =

Maria Florina Balcan and Steven Ehrlich and Yingyu Liang , journal =. Distributed k-Means and k-Median Clustering on General Topologies , year =

[11] [11]

k-Means for Streaming and Distributed Big Sparse Data , year =

Artem Barger and Dan Feldman , booktitle =. k-Means for Streaming and Distributed Big Sparse Data , year =

[12] [12]

Paul Beame and Paraschos Koutris and Dan Suciu , title =. J

[13] [13]

Clustering uncertain graphs , year =

Matteo Ceccarello and Carlo Fantozzi and Andrea Pietracaprina and Geppino Pucci and Fabio Vandin , journal =. Clustering uncertain graphs , year =. doi:10.1145/3186728.3164143 , publisher =

work page doi:10.1145/3186728.3164143

[14] [14]

Solving k-center clustering (with outliers) in

Matteo Ceccarello and Andrea Pietracaprina and Geppino Pucci , journal =. Solving k-center clustering (with outliers) in. 2019 , month =. doi:10.14778/3317315.3317319 , publisher =

work page doi:10.14778/3317315.3317319 2019

[15] [15]

Average Distance Queries through Weighted Samples in Graphs and Metric Spaces: High Scalability with Tight Statistical Guarantees , year =

Shiri Chechik and Edith Cohen and Haim Kaplan , booktitle =. Average Distance Queries through Weighted Samples in Graphs and Metric Spaces: High Scalability with Tight Statistical Guarantees , year =

[16] [16]

Clustering Small Samples With Quality Guarantees: Adaptivity With One2all

Edith Cohen and Shiri Chechik and Haim Kaplan , booktitle =. Clustering Small Samples With Quality Guarantees: Adaptivity With One2all. 2018 , editor =

2018

[17] [17]

2008 , month =

Jeffrey Dean and Sanjay Ghemawat , journal =. 2008 , month =. doi:10.1145/1327452.1327492 , publisher =

work page doi:10.1145/1327452.1327492 2008

[18] [18]

Fast clustering using

Alina Ene and Sungjin Im and Benjamin Moseley , booktitle =. Fast clustering using. 2011 , publisher =

2011

[19] [19]

A unified framework for approximating and clustering data , year =

Dan Feldman and Michael Langberg , booktitle =. A unified framework for approximating and clustering data , year =

[20] [20]

Gereon Frahling and Christian Sohler , journal =. A. 2008 , month =. doi:10.1142/s0218195908002787 , publisher =

work page doi:10.1142/s0218195908002787 2008

[21] [21]

Data mining: concepts and techniques , year =

Jiawei Han and Micheline Kamber and Jian Pei , publisher =. Data mining: concepts and techniques , year =

[22] [22]

Handbook of Cluster Analysis , year =

Christian Hennig and Marina Meila and Fionn Murtagh and Roberto Rocci , publisher =. Handbook of Cluster Analysis , year =

[23] [23]

Shahriar Hossain and Rafal A

M. Shahriar Hossain and Rafal A. Angryk , booktitle =. 2007 , month =

2007

[24] [24]

Hruschka and L.N

E.R. Hruschka and L.N. de Castro and R.J.G.B. Campello , booktitle =. Evolutionary Algorithms for Clustering Gene-Expression Data , publisher =

[25] [25]

Karloff and Siddharth Suri and Sergei Vassilvitskii , booktitle =

Howard J. Karloff and Siddharth Suri and Sergei Vassilvitskii , booktitle =. A Model of Computation for MapReduce , year =

[26] [26]

Mining of Massive Datasets , year =

Jure Leskovec and Anand Rajaraman and Jeffrey David Ullman , publisher =. Mining of Massive Datasets , year =

[27] [27]

Lloyd , journal =

S. Lloyd , journal =. Least squares quantization in. 1982 , month =. doi:10.1109/tit.1982.1056489 , publisher =

work page doi:10.1109/tit.1982.1056489 1982

[28] [28]

Kusner and Wenlin Chen and Kilian Q

Gustavo Malkomes and Matt J. Kusner and Wenlin Chen and Kilian Q. Weinberger and Benjamin Moseley , booktitle =. Fast Distributed k-Center Clustering with Outliers on Massive Data , year =

[29] [29]

Accurate MapReduce Algorithms for k-Median and k-Means in General Metric Spaces , year =

Alessio Mazzetto and Andrea Pietracaprina and Geppino Pucci , booktitle =. Accurate MapReduce Algorithms for k-Median and k-Means in General Metric Spaces , year =

[30] [30]

Jaskowiak and Ricardo J

Davoud Moulavi and Pablo A. Jaskowiak and Ricardo J. G. B. Campello and Arthur Zimek and J. Density-Based Clustering Validation , year =. Proceedings of the 2014

2014

[31] [31]

Ng and Jiawei Han , booktitle =

Raymond T. Ng and Jiawei Han , booktitle =. Efficient and Effective Clustering Methods for Spatial Data Mining , year =

[32] [32]

Space-round tradeoffs for MapReduce computations , year =

Andrea Pietracaprina and Geppino Pucci and Matteo Riondato and Francesco Silvestri and Eli Upfal , booktitle =. Space-round tradeoffs for MapReduce computations , year =

[33] [33]

Rousseeuw , journal =

Peter J. Rousseeuw , journal =. Silhouettes: A graphical aid to the interpretation and validation of cluster analysis , year =. doi:10.1016/0377-0427(87)90125-7 , publisher =

work page doi:10.1016/0377-0427(87)90125-7

[34] [34]

Kersten , journal =

Thibault Sellam and Robin Cijvat and Richard Koopmanschap and Martin L. Kersten , journal =. Blaeu: Mapping and Navigating Large Tables with Cluster Analysis , year =. doi:10.14778/3007263.3007288 , url =

work page doi:10.14778/3007263.3007288

[35] [35]

2005 , address =

Pang-Ning Tan and Michael Steinbach and Vipin Kumar , publisher =. 2005 , address =

2005

[36] [36]

Aggarwal and Yelong Shen , booktitle =

Lin Liu and Ruoming Jin and Charu C. Aggarwal and Yelong Shen , booktitle =. Reliable Clustering on Uncertain Graphs , year =

[37] [37]

Emmendorfer and Eduardo Nunes Borges and Karina S

Caroline Tomasini and Leonardo R. Emmendorfer and Eduardo Nunes Borges and Karina S. Machado , booktitle =. A methodology for selecting the most suitable cluster validation internal indices , year =

[38] [38]

An Analysis of the Application of Simplified Silhouette to the Evaluation of k-means Clustering Validity , year =

Fei Wang and Hector. An Analysis of the Application of Simplified Silhouette to the Evaluation of k-means Clustering Validity , year =. Machine Learning and Data Mining in Pattern Recognition - 13th International Conference,. doi:10.1007/978-3-319-62416-7\_21 , url =

work page doi:10.1007/978-3-319-62416-7

[39] [39]

Comparing the performance of biomedical clustering methods , year =

Christian Wiwie and Jan Baumbach and Richard Röttger , journal =. Comparing the performance of biomedical clustering methods , year =. doi:10.1038/nmeth.3583 , publisher =

work page doi:10.1038/nmeth.3583

[40] [40]

Clustering Validation Measures , year =

Hui Xiong and Zhongmou Li , booktitle =. Clustering Validation Measures , year =

[41] [41]

2009 , month = jul, abstract =

Andreas Maurer and Massimiliano Pontil , title =. 2009 , month = jul, abstract =

2009

[42] [42]

Analytica Chimica Acta , title =

Llet. Analytica Chimica Acta , title =. 2004 , issn =. doi:10.1016/j.aca.2003.12.020 , publisher =

work page doi:10.1016/j.aca.2003.12.020 2004

[43] [43]

Silhouette Index as Clustering Evaluation Tool , year =

Dudek, Andrzej , pages =. Silhouette Index as Clustering Evaluation Tool , year =. Classification and Data Analysis , doi =

[44] [44]

Probability and Computing , year =

Michael Mitzenmacher and Eli Upfal , publisher =. Probability and Computing , year =

[45] [45]

, journal =

Schubert, Erich and Rousseeuw, Peter J. , journal =. Fast and eager k -medoids clustering:. 2021 , issn =. doi:10.1016/j.is.2021.101804 , publisher =

work page doi:10.1016/j.is.2021.101804 2021

[46] [46]

A new partitioning around medoids algorithm , year =

Van der Laan, Mark and Pollard, Katherine and Bryan, Jennifer , journal =. A new partitioning around medoids algorithm , year =. doi:10.1080/0094965031000136012 , publisher =

work page doi:10.1080/0094965031000136012

[47] [47]

Medoid Silhouette clustering with automatic cluster number selection , year =

Lenssen, Lars and Schubert, Erich , journal =. Medoid Silhouette clustering with automatic cluster number selection , year =. doi:10.1016/j.is.2023.102290 , publisher =

work page doi:10.1016/j.is.2023.102290 2023

[48] [48]

and Perona, Iñigo , journal =

Arbelaitz, Olatz and Gurrutxaga, Ibai and Muguerza, Javier and Pérez, Jesús M. and Perona, Iñigo , journal =. An extensive comparative study of cluster validity indices , year =. doi:10.1016/j.patcog.2012.07.021 , publisher =

work page doi:10.1016/j.patcog.2012.07.021 2012

[49] [49]

Clustering by Means of Medoids , year =

Kaufmann, Leonard and Rousseeuw, Peter , journal =. Clustering by Means of Medoids , year =

[50] [50]

Silhouette coefficient-based weighting k-means algorithm , year =

Lai, Huixia and Huang, Tao and Lu, BinLong and Zhang, Shi and Xiaog, Ruliang , journal =. Silhouette coefficient-based weighting k-means algorithm , year =. doi:10.1007/s00521-024-10706-0 , publisher =

work page doi:10.1007/s00521-024-10706-0

[51] [51]

Cluster Quality Analysis Using Silhouette Score , year =

Shahapure, Ketan Rajshekhar and Nicholas, Charles , booktitle =. Cluster Quality Analysis Using Silhouette Score , year =

[52] [52]

Scalable Distributed Approximation of Internal Measures for Clustering Evaluation , year =

Altieri, Federico and Pietracaprina, Andrea and Pucci, Geppino and Vandin, Fabio , pages =. Scalable Distributed Approximation of Internal Measures for Clustering Evaluation , year =. Proceedings of the 2021 SIAM International Conference on Data Mining (SDM) , doi =

2021

[53] [53]

A Comprehensive Survey of Clustering Algorithms , year =

Xu, Dongkuan and Tian, Yingjie , journal =. A Comprehensive Survey of Clustering Algorithms , year =. doi:10.1007/s40745-015-0040-1 , publisher =

work page doi:10.1007/s40745-015-0040-1

[54] [54]

and Ezugwu, Absalom E

Ikotun, Abiodun M. and Ezugwu, Absalom E. and Abualigah, Laith and Abuhaija, Belal and Heming, Jia , journal =. K-means clustering algorithms: A comprehensive review, variants analysis, and advances in the era of big data , year =. doi:10.1016/j.ins.2022.11.139 , publisher =

work page doi:10.1016/j.ins.2022.11.139 2022

[55] [55]

A Fast and High Quality Multilevel Scheme for Partitioning Irregular Graphs , year =

Karypis, George and Kumar, Vipin , journal =. A Fast and High Quality Multilevel Scheme for Partitioning Irregular Graphs , year =. doi:10.1137/s1064827595287997 , publisher =

work page doi:10.1137/s1064827595287997

[56] [56]

Advances in neural information processing systems , volume=

An impossibility theorem for clustering , author=. Advances in neural information processing systems , volume=

[57] [57]

Stop using the elbow criterion for k-means and how to choose the number of clusters instead , year =

Schubert, Erich , journal =. Stop using the elbow criterion for k-means and how to choose the number of clusters instead , year =. doi:10.1145/3606274.3606278 , publisher =

work page doi:10.1145/3606274.3606278

[58] [58]

Understanding of Internal Clustering Validation Measures , year =

Liu, Yanchi and Li, Zhongmou and Xiong, Hui and Gao, Xuedong and Wu, Junjie , booktitle =. Understanding of Internal Clustering Validation Measures , year =

[59] [59]

and Tayfor, Noor Bahjat and Hassan, Alla A

Hassan, Bryar A. and Tayfor, Noor Bahjat and Hassan, Alla A. and Ahmed, Aram M. and Rashid, Tarik A. and Abdalla, Naz N. , journal =. From A-to-Z review of clustering validation indices , year =. doi:10.1016/j.neucom.2024.128198 , publisher =

work page doi:10.1016/j.neucom.2024.128198 2024

[60] [60]

Silhouette scores for assessment of SNP genotype clusters , year =

Lovmar, Lovisa and Ahlford, Annika and Jonsson, Mats and Syvänen, Ann-Christine , journal =. Silhouette scores for assessment of SNP genotype clusters , year =. doi:10.1186/1471-2164-6-35 , publisher =

work page doi:10.1186/1471-2164-6-35

[61] [61]

Estimating the Optimal Number of Clusters in Categorical Data Clustering by Silhouette Coefficient , year =

Dinh, Duy-Tai and Fujinami, Tsutomu and Huynh, Van-Nam , pages =. Estimating the Optimal Number of Clusters in Categorical Data Clustering by Silhouette Coefficient , year =. Knowledge and Systems Sciences , doi =

[62] [62]

2023 , volume =

Im, Sungjin and Kumar, Ravi and Lattanzi, Silvio and Moseley, Benjamin and Vassilvitskii, Sergei , title =. 2023 , volume =

2023