Clustering is difficult only when it does not matter

Clustering is difficult only when it does not matter , author= · 2012 · cs.LG · arXiv 1205.4891

3 Pith papers cite this work. Polarity classification is still indexing.

3 Pith papers citing it

open full Pith review browse 3 citing papers arXiv PDF

abstract

Numerous papers ask how difficult it is to cluster data. We suggest that the more relevant and interesting question is how difficult it is to cluster data sets {\em that can be clustered well}. More generally, despite the ubiquity and the great importance of clustering, we still do not have a satisfactory mathematical theory of clustering. In order to properly understand clustering, it is clearly necessary to develop a solid theoretical basis for the area. For example, from the perspective of computational complexity theory the clustering problem seems very hard. Numerous papers introduce various criteria and numerical measures to quantify the quality of a given clustering. The resulting conclusions are pessimistic, since it is computationally difficult to find an optimal clustering of a given data set, if we go by any of these popular criteria. In contrast, the practitioners' perspective is much more optimistic. Our explanation for this disparity of opinions is that complexity theory concentrates on the worst case, whereas in reality we only care for data sets that can be clustered well. We introduce a theoretical framework of clustering in metric spaces that revolves around a notion of "good clustering". We show that if a good clustering exists, then in many cases it can be efficiently found. Our conclusion is that contrary to popular belief, clustering should not be considered a hard task.

citation-role summary

background 1

citation-polarity summary

background 1

representative citing papers

Analysis of Ward's Method

cs.DS · 2019-07-11 · unverdicted · novelty 7.0

Ward's method yields a 2-approximation for k-means under well-separated optima (recovering the optimum under balance) in any dimension, with Ω((3/2)^d) lower bounds without separation and O(1) in 1D.

Optimal Phylogenetic Reconstruction from Sampled Quartets

cs.DS · 2026-04-19 · unverdicted · novelty 7.0

An efficient algorithm recovers phylogenetic trees from Θ(n) noisy quartets under random classification noise, matching the information-theoretic lower bound and achieving near-optimal quartet distance.

Provable Accuracy Collapse in Embedding-Based Representations under Dimensionality Mismatch

cs.DS · 2026-05-05 · unverdicted · novelty 6.0

Triplet constraints realizable in D-dimensional Euclidean space cannot be preserved above 50% accuracy by any embedding of dimension at most cD for constant c<1, with UGC-hardness preventing better polynomial-time solutions in any dimension.

citing papers explorer

Showing 3 of 3 citing papers.

Analysis of Ward's Method cs.DS · 2019-07-11 · unverdicted · none · ref 18 · internal anchor
Ward's method yields a 2-approximation for k-means under well-separated optima (recovering the optimum under balance) in any dimension, with Ω((3/2)^d) lower bounds without separation and O(1) in 1D.
Optimal Phylogenetic Reconstruction from Sampled Quartets cs.DS · 2026-04-19 · unverdicted · none · ref 191
An efficient algorithm recovers phylogenetic trees from Θ(n) noisy quartets under random classification noise, matching the information-theoretic lower bound and achieving near-optimal quartet distance.
Provable Accuracy Collapse in Embedding-Based Representations under Dimensionality Mismatch cs.DS · 2026-05-05 · unverdicted · none · ref 191
Triplet constraints realizable in D-dimensional Euclidean space cannot be preserved above 50% accuracy by any embedding of dimension at most cD for constant c<1, with UGC-hardness preventing better polynomial-time solutions in any dimension.

Clustering is difficult only when it does not matter

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer