Recognition: unknown
Distributionally Robust K-Means Clustering
Pith reviewed 2026-05-10 16:34 UTC · model grok-4.3
The pith
K-means clustering is made robust to outliers and shifts by minimizing the worst-case squared distance over distributions in a Wasserstein-2 ball around the empirical data.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Viewing k-means as Lloyd-Max quantization of the empirical distribution, we posit that the unknown population distribution lies inside a Wasserstein-2 ball centered at that empirical measure. Cluster centers are chosen to minimize the worst-case expected squared distance to the data over all measures inside this ball, producing a minimax problem. The dual of this problem is tractable and yields a soft-clustering scheme in which each point receives a smoothly weighted assignment to the centers rather than a hard label. A block coordinate descent algorithm solves the resulting problem with a monotonic decrease at each iteration and local linear convergence.
What carries the argument
The minimax objective over the Wasserstein-2 ambiguity set around the empirical distribution, whose dual reformulation produces the soft weighted assignments.
If this is right
- The block coordinate descent algorithm decreases the objective at every iteration.
- The same algorithm converges locally at a linear rate once near a stationary point.
- The soft weights improve outlier detection compared with hard assignments on standard benchmarks.
- Performance remains stable under added noise on large synthetic data sets.
Where Pith is reading between the lines
- The same Wasserstein-ball construction could be applied to other prototype-based methods such as Gaussian mixture models by replacing the squared-distance loss with the appropriate log-likelihood term.
- Because the dual weights are smooth, they might serve directly as soft labels for downstream semi-supervised tasks without an extra post-processing step.
- The radius of the Wasserstein ball acts as a single tunable robustness parameter that could be chosen by cross-validation on a small validation set.
Load-bearing premise
The true population distribution stays inside the chosen Wasserstein-2 ball around the observed data points, and the worst-case expectation over that ball admits an explicit dual that can be optimized.
What would settle it
On a collection of points drawn from a distribution known to lie well outside the Wasserstein-2 ball used in training, the robust centers should show no improvement over ordinary k-means in held-out squared-error or outlier-masking metrics.
Figures
read the original abstract
K-means clustering is a workhorse of unsupervised learning, but it is notoriously brittle to outliers, distribution shifts, and limited sample sizes. Viewing k-means as Lloyd--Max quantization of the empirical distribution, we develop a distributionally robust variant that protects against such pathologies. We posit that the unknown population distribution lies within a Wasserstein-2 ball around the empirical distribution. In this setting, one seeks cluster centers that minimize the worst-case expected squared distance over this ambiguity set, leading to a minimax formulation. A tractable dual yields a soft-clustering scheme that replaces hard assignments with smoothly weighted ones. We propose an efficient block coordinate descent algorithm with provable monotonic decrease and local linear convergence. Experiments on standard benchmarks and large-scale synthetic data demonstrate substantial gains in outlier detection and robustness to noise.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper develops a distributionally robust variant of K-means by minimizing the worst-case expected squared distance over a Wasserstein-2 ball around the empirical distribution. A tractable dual reformulates the problem as soft clustering with smoothly weighted assignments to centers. An efficient block coordinate descent algorithm is proposed with claimed monotonic decrease and local linear convergence, supported by experiments showing gains in outlier robustness on benchmarks and synthetic data.
Significance. If the dual derivation and convergence analysis hold, the work supplies a principled, computationally tractable way to harden K-means against outliers and distribution shift, a frequent practical failure mode. The soft-assignment reformulation and BCD scheme with monotonicity guarantee are attractive for implementation; the experiments provide supporting evidence of practical benefit.
major comments (1)
- [Convergence Analysis] Convergence Analysis (BCD algorithm): The local linear convergence claim rests on smoothness/strong-convexity properties of the dualized objective that are not automatically preserved when soft weights approach zero for some clusters (possible for moderate-to-large Wasserstein radii or imbalanced data). The manuscript does not state an explicit non-degeneracy condition or show that the algorithm maintains the required Hessian positive-definiteness or Jacobian contraction in this regime.
Simulated Author's Rebuttal
We thank the referee for the careful reading and constructive feedback. The referee's observation on the convergence analysis is well-taken, and we will revise the manuscript to address it explicitly.
read point-by-point responses
-
Referee: Convergence Analysis (BCD algorithm): The local linear convergence claim rests on smoothness/strong-convexity properties of the dualized objective that are not automatically preserved when soft weights approach zero for some clusters (possible for moderate-to-large Wasserstein radii or imbalanced data). The manuscript does not state an explicit non-degeneracy condition or show that the algorithm maintains the required Hessian positive-definiteness or Jacobian contraction in this regime.
Authors: We appreciate this observation. The local linear convergence result (Theorem 4.2) relies on the dual objective satisfying uniform smoothness and strong convexity in a neighborhood of the iterates, which in turn requires the soft weights to be bounded away from zero. This holds under the radii used in our experiments and when clusters are reasonably separated, but the manuscript does not state an explicit non-degeneracy assumption. In the revision we will add a remark to the theorem statement clarifying this condition, together with a brief discussion of regimes where the weights remain positive (e.g., when the Wasserstein radius is not large relative to inter-cluster separation). We will also confirm that the weights stay bounded away from zero in the reported experiments. This clarification strengthens the presentation while preserving the main claims and algorithm. revision: yes
Circularity Check
No significant circularity; standard DRO dualization and new algorithm
full rationale
The derivation begins from an external modeling choice (Wasserstein-2 ball around the empirical measure) and proceeds via standard minimax duality to obtain a soft-clustering objective, followed by a newly proposed block coordinate descent procedure whose monotonicity and local linear convergence are asserted via direct analysis of the resulting non-smooth but convex program. No step equates a claimed prediction or uniqueness result to a fitted parameter or prior self-citation by construction; the radius remains a free hyperparameter, the dual is obtained from first-principles Lagrangian arguments, and the convergence claims rest on properties of the dualized loss rather than tautological renaming or self-referential fitting. The paper is therefore self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
free parameters (1)
- Wasserstein ball radius
axioms (1)
- standard math Wasserstein-2 distance defines a valid metric on probability measures
Reference graph
Works this paper leans on
-
[1]
Some methods for classification and analysis of multivariate observations,
J. MacQueenet al., “Some methods for classification and analysis of multivariate observations,” inProceedings of the fifth Berkeley symposium on mathematical statistics and probability, vol. 1-14. Oakland, CA, USA, 1967, pp. 281–297. 34
1967
-
[2]
k-means++: the advantages of careful seeding,
D. Arthur and S. Vassilvitskii, “ k-means++: the advantages of careful seeding,” inProceedings of the Eighteenth Annual ACM-SIAM Symposium on Discrete Algorithms, ser. SODA ’07. USA: Society for Industrial and Applied Mathematics, 2007, p. 1027–1035
2007
-
[3]
Robust k-means: a theoretical revisit,
A. Georgogiannis, “Robust k-means: a theoretical revisit,” inProceedings of the 30th International Conference on Neural Information Processing Systems, ser. NIPS’16. Red Hook, NY , USA: Curran Associates Inc., 2016, pp. 2891–2899, event-place: Barcelona, Spain
2016
-
[4]
Adaptively robust and sparse $k$-means clustering,
H. Li, S. Sugasawa, and S. Katayama, “Adaptively robust and sparse $k$-means clustering,”Transactions on Machine Learning Research,
-
[5]
Available: https://openreview.net/forum?id=EhC84fT2yA
[Online]. Available: https://openreview.net/forum?id=EhC84fT2yA
-
[6]
Robust trimmed k-means,
O. Dorabiala, J. N. Kutz, and A. Y . Aravkin, “Robust trimmed k-means,”Pattern Recognition Letters, vol. 161, pp. 9–16, Sep. 2022. [Online]. Available: https://linkinghub.elsevier.com/retrieve/pii/S0167865522002185
2022
-
[7]
Greedy sampling for approximate clustering in the presence of outliers,
A. Bhaskara, S. Vadgama, and H. Xu, “Greedy sampling for approximate clustering in the presence of outliers,”Advances in Neural Information Processing Systems, vol. 32, 2019
2019
-
[8]
Improved Outlier Robust Seeding for k-means,
A. Deshpande and R. Pratap, “Improved Outlier Robust Seeding for k-means,” Sep. 2023, arXiv:2309.02710 [cs]. [Online]. Available: http://arxiv.org/abs/2309.02710
-
[9]
URLhttps://doi.org/10.1214/aos/1176345338
D. Pollard, “Strong consistency of K-means clustering,”The Annals of Statistics, vol. 9, no. 1, pp. 135–140, 1981. [Online]. Available: https://doi.org/10.1214/aos/1176345339
-
[10]
Fast rates for empirical vector quantization,
C. Levrard, “Fast rates for empirical vector quantization,”Electronic Journal of Statistics, vol. 7, pp. 1716–1746, 2013. [Online]. Available: https://doi.org/10.1214/13-EJS822
-
[11]
Non-asymptotic bounds for vector quantization in Hilbert spaces,
——, “Non-asymptotic bounds for vector quantization in Hilbert spaces,”The Annals of Statistics, vol. 43, no. 2, pp. 592–619, 2015. [Online]. Available: https://doi.org/10.1214/14-AOS1293
-
[12]
Robust k-means clustering for distributions with two moments,
Y . Klochkov, A. Kroshnin, and N. Zhivotovskiy, “Robust k-means clustering for distributions with two moments,”The Annals of Statistics, vol. 49, no. 4, pp. 2206 – 2230, 2021. [Online]. Available: https://doi.org/10.1214/20-AOS2033
-
[13]
Fuzzy pca-guided robust k-means clustering,
K. Honda, A. Notsu, and H. Ichihashi, “Fuzzy pca-guided robust k-means clustering,”IEEE Transactions on Fuzzy Systems, vol. 18, no. 1, pp. 67–79, 2010
2010
-
[14]
KMD clustering: robust general-purpose clustering of biological data,
A. Zelig, H. Kariti, and N. Kaplan, “KMD clustering: robust general-purpose clustering of biological data,”Communications Biology, vol. 6, no. 1, p. 1110, Nov. 2023. [Online]. Available: https://www.nature.com/articles/s42003-023-05480-z
2023
-
[15]
Learning Feature Representations with K-Means,
A. Coates and A. Y . Ng, “Learning Feature Representations with K-Means,” inNeural Networks: Tricks of the Trade, G. Montavon, G. B. Orr, and K.-R. M ¨uller, Eds. Berlin, Heidelberg: Springer Berlin Heidelberg, 2012, vol. 7700, pp. 561–580, series Title: Lecture Notes in Computer Science. [Online]. Available: http://link.springer.com/10.1007/978-3-642-35289-8 30
-
[16]
Clustering High-Dimensional Data via Feature Selection,
T. Liu, Y . Lu, B. Zhu, and H. Zhao, “Clustering High-Dimensional Data via Feature Selection,”Biometrics, vol. 79, no. 2, pp. 940–950, Jun. 2023. [Online]. Available: https://academic.oup.com/biometrics/article/79/2/940-950/7513996
2023
-
[17]
A review of robust clustering methods,
L. A. Garc ´ıa-Escudero, A. Gordaliza, C. Matr ´an, and A. Mayo-Iscar, “A review of robust clustering methods,”Advances in Data Analysis and Classification, vol. 4, no. 2-3, pp. 89–109, Sep. 2010. [Online]. Available: http://link.springer.com/10.1007/s11634-010-0064-5
-
[18]
Robust k-means++,
A. Deshpande, P. Kacham, and R. Pratap, “Robust k-means++,” inProceedings of the 36th Conference on Uncertainty in Artificial Intelligence (UAI), ser. Proceedings of Machine Learning Research, J. Peters and D. Sontag, Eds., vol. 124. PMLR, Aug. 2020, pp. 799–808. [Online]. Available: https://proceedings.mlr.press/v124/deshpande20a.html
2020
-
[19]
Least squares quantization in PCM,
S. Lloyd, “Least squares quantization in PCM,”IEEE Transactions on Information Theory, vol. 28, no. 2, pp. 129–137, Mar 1982
1982
-
[20]
Quantizing for minimum distortion,
J. Max, “Quantizing for minimum distortion,”IRE Transactions on Information Theory, vol. 6, no. 1, pp. 7–12, Mar 1960
1960
-
[21]
The complexity of the generalized Lloyd–Max problem (corresp.),
M. Garey, D. Johnson, and H. Witsenhausen, “The complexity of the generalized Lloyd–Max problem (corresp.),”IEEE Transactions on Information Theory, vol. 28, no. 2, pp. 255–256, Mar 1982
1982
-
[22]
Villani,Optimal transport: old and new, ser
C. Villani,Optimal transport: old and new, ser. Grundlehren der mathematischen Wissenschaften. Berlin: Springer, 2009, no. 338, oCLC: ocn244421231
2009
-
[23]
Wasserstein distributionally robust regret-optimal control in the infinite-horizon,
T. Kargin, J. Hajar, V . Malik, and B. Hassibi, “Wasserstein distributionally robust regret-optimal control in the infinite-horizon,” 2023
2023
-
[24]
Wasserstein distributionally robust Kalman filtering,
S. Shafieezadeh Abadeh, V . A. Nguyen, D. Kuhn, and P. M. Mohajerin Esfahani, “Wasserstein distributionally robust Kalman filtering,” Advances in Neural Information Processing Systems, vol. 31, 2018
2018
-
[25]
Wasserstein distributionally robust optimization: Theory and applications in machine learning,
D. Kuhn, P. M. Esfahani, V . A. Nguyen, and S. Shafieezadeh-Abadeh, “Wasserstein distributionally robust optimization: Theory and applications in machine learning,” inOperations research & management science in the age of analytics. Informs, 2019, pp. 130–166
2019
-
[26]
A distributionally robust approach to Shannon limits using the Wasserstein distance,
V . Malik, T. Kargin, V . Kostina, and B. Hassibi, “A distributionally robust approach to Shannon limits using the Wasserstein distance,” arXiv preprint arXiv:2405.06528, 2024
-
[27]
N. Fournier and A. Guillin, “On the rate of convergence in Wasserstein distance of the empirical measure,”Probability Theory and Related Fields, vol. 162, no. 3-4, pp. 707–738, Aug. 2015. [Online]. Available: https://link.springer.com/10.1007/s00440-014-0583-7 35
-
[28]
Distributionally robust stochastic optimization with Wasserstein distance,
R. Gao and A. Kleywegt, “Distributionally robust stochastic optimization with Wasserstein distance,”Mathematics of Operations Research, vol. 48, no. 2, pp. 603–655, 2023
2023
-
[29]
The UCI machine learning repository
M. Kelly, R. Longjohn, and K. Nottingham, “The UCI machine learning repository.” [Online]. Available: https://archive.ics.uci.edu
-
[30]
On general minimax theorems
M. Sion, “On general minimax theorems.”Pacific Journal of Mathematics, vol. 8, no. 1, pp. 171 – 176, 1958, publisher: Pacific Journal of Mathematics, A Non-profit Corporation
1958
-
[31]
A block coordinate descent method for regularized multiconvex optimization with applications to nonnegative tensor factorization and completion,
Y . Xu and W. Yin, “A block coordinate descent method for regularized multiconvex optimization with applications to nonnegative tensor factorization and completion,”SIAM Journal on Imaging Sciences, vol. 6, no. 3, pp. 1758–1789, 2013
2013
-
[32]
Boyd and L
S. Boyd and L. Vandenberghe,Convex Optimization. Cambridge University Press, 2004
2004
-
[33]
J. F. Bonnans and A. Shapiro,Perturbation Analysis of Optimization Problems. Springer Science & Business Media, 2013
2013
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.