arxiv: 2604.11118 · v1 · submitted 2026-04-13 · 💻 cs.LG · stat.ML

Recognition: unknown

Distributionally Robust K-Means Clustering

Vikrant Malik , Taylan Kargin , Babak Hassibi

Authors on Pith no claims yet

Pith reviewed 2026-05-10 16:34 UTC · model grok-4.3

classification 💻 cs.LG stat.ML

keywords distributionally robust optimizationk-means clusteringWasserstein distancesoft clusteringblock coordinate descentrobustness to outliersunsupervised learning

0 comments

The pith

K-means clustering is made robust to outliers and shifts by minimizing the worst-case squared distance over distributions in a Wasserstein-2 ball around the empirical data.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops a distributionally robust version of k-means by viewing it as quantization of the observed data points and requiring the centers to perform well even for the worst distribution inside a ball of possible perturbations. A sympathetic reader would care because ordinary k-means assignments break down when a few noisy points or a modest shift in the underlying population occurs, and the new formulation supplies a concrete way to protect against those failures while keeping computation feasible. The key step is showing that the resulting minimax problem has a tractable dual whose solution replaces the usual hard cluster assignments with smooth, distance-weighted probabilities. An efficient block coordinate descent procedure is then derived that is guaranteed to decrease the objective at every step and to converge locally at a linear rate. Experiments indicate that the resulting centers detect outliers more reliably and maintain accuracy under added noise on both standard datasets and large synthetic examples.

Core claim

Viewing k-means as Lloyd-Max quantization of the empirical distribution, we posit that the unknown population distribution lies inside a Wasserstein-2 ball centered at that empirical measure. Cluster centers are chosen to minimize the worst-case expected squared distance to the data over all measures inside this ball, producing a minimax problem. The dual of this problem is tractable and yields a soft-clustering scheme in which each point receives a smoothly weighted assignment to the centers rather than a hard label. A block coordinate descent algorithm solves the resulting problem with a monotonic decrease at each iteration and local linear convergence.

What carries the argument

The minimax objective over the Wasserstein-2 ambiguity set around the empirical distribution, whose dual reformulation produces the soft weighted assignments.

If this is right

The block coordinate descent algorithm decreases the objective at every iteration.
The same algorithm converges locally at a linear rate once near a stationary point.
The soft weights improve outlier detection compared with hard assignments on standard benchmarks.
Performance remains stable under added noise on large synthetic data sets.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same Wasserstein-ball construction could be applied to other prototype-based methods such as Gaussian mixture models by replacing the squared-distance loss with the appropriate log-likelihood term.
Because the dual weights are smooth, they might serve directly as soft labels for downstream semi-supervised tasks without an extra post-processing step.
The radius of the Wasserstein ball acts as a single tunable robustness parameter that could be chosen by cross-validation on a small validation set.

Load-bearing premise

The true population distribution stays inside the chosen Wasserstein-2 ball around the observed data points, and the worst-case expectation over that ball admits an explicit dual that can be optimized.

What would settle it

On a collection of points drawn from a distribution known to lie well outside the Wasserstein-2 ball used in training, the robust centers should show no improvement over ordinary k-means in held-out squared-error or outlier-masking metrics.

Figures

Figures reproduced from arXiv: 2604.11118 by Babak Hassibi, Taylan Kargin, Vikrant Malik.

**Figure 2.** Figure 2: A realization of the dataset in Experiment 2. The blue points are in Cluster A and the green points are in Cluster B. We see that the [PITH_FULL_IMAGE:figures/full_fig_p032_2.png] view at source ↗

**Figure 3.** Figure 3: The trend of the worst-case error in Experiment 3, averaged over 30 trials for each [PITH_FULL_IMAGE:figures/full_fig_p033_3.png] view at source ↗

read the original abstract

K-means clustering is a workhorse of unsupervised learning, but it is notoriously brittle to outliers, distribution shifts, and limited sample sizes. Viewing k-means as Lloyd--Max quantization of the empirical distribution, we develop a distributionally robust variant that protects against such pathologies. We posit that the unknown population distribution lies within a Wasserstein-2 ball around the empirical distribution. In this setting, one seeks cluster centers that minimize the worst-case expected squared distance over this ambiguity set, leading to a minimax formulation. A tractable dual yields a soft-clustering scheme that replaces hard assignments with smoothly weighted ones. We propose an efficient block coordinate descent algorithm with provable monotonic decrease and local linear convergence. Experiments on standard benchmarks and large-scale synthetic data demonstrate substantial gains in outlier detection and robustness to noise.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Wasserstein DRO gives k-means a soft dual and BCD solver with monotonicity, but local linear convergence looks shaky once some weights hit zero.

read the letter

The paper's core move is to replace standard k-means with a minimax problem over a Wasserstein-2 ball around the empirical measure. The dual of that problem produces per-point weights over the K centers instead of hard assignments, and they solve it with block coordinate descent that they prove decreases the objective monotonically and converges locally linearly. That combination of DRO, the explicit soft dual, and the convergence claim is the new piece relative to earlier robust k-means work. The experiments on benchmarks and synthetic data show better outlier handling than vanilla Lloyd, which is the practical payoff they emphasize. The formulation itself is clean and the algorithm looks implementable without heavy machinery. The radius of the Wasserstein ball remains a free hyperparameter, so the method still needs tuning, but that is standard for DRO approaches. The bigger question is the local-linear rate. When the radius grows or clusters are uneven, some of the soft weights can approach zero and the effective loss in the center variables loses the uniform smoothness or strong convexity that linear convergence arguments usually rely on. The stress-test note flags exactly this regime, and nothing in the abstract rules it out. If the full proofs only cover the strictly positive-weight case, the guarantee is narrower than stated. The paper is aimed at people who already work on robust unsupervised methods or need clustering that tolerates label noise and shifts. It has enough formal structure and a clear algorithmic contribution that a serious editor should send it to referees rather than desk-reject, provided the convergence analysis is checked carefully for the degenerate-weight case.

Referee Report

1 major / 0 minor

Summary. The paper develops a distributionally robust variant of K-means by minimizing the worst-case expected squared distance over a Wasserstein-2 ball around the empirical distribution. A tractable dual reformulates the problem as soft clustering with smoothly weighted assignments to centers. An efficient block coordinate descent algorithm is proposed with claimed monotonic decrease and local linear convergence, supported by experiments showing gains in outlier robustness on benchmarks and synthetic data.

Significance. If the dual derivation and convergence analysis hold, the work supplies a principled, computationally tractable way to harden K-means against outliers and distribution shift, a frequent practical failure mode. The soft-assignment reformulation and BCD scheme with monotonicity guarantee are attractive for implementation; the experiments provide supporting evidence of practical benefit.

major comments (1)

[Convergence Analysis] Convergence Analysis (BCD algorithm): The local linear convergence claim rests on smoothness/strong-convexity properties of the dualized objective that are not automatically preserved when soft weights approach zero for some clusters (possible for moderate-to-large Wasserstein radii or imbalanced data). The manuscript does not state an explicit non-degeneracy condition or show that the algorithm maintains the required Hessian positive-definiteness or Jacobian contraction in this regime.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the careful reading and constructive feedback. The referee's observation on the convergence analysis is well-taken, and we will revise the manuscript to address it explicitly.

read point-by-point responses

Referee: Convergence Analysis (BCD algorithm): The local linear convergence claim rests on smoothness/strong-convexity properties of the dualized objective that are not automatically preserved when soft weights approach zero for some clusters (possible for moderate-to-large Wasserstein radii or imbalanced data). The manuscript does not state an explicit non-degeneracy condition or show that the algorithm maintains the required Hessian positive-definiteness or Jacobian contraction in this regime.

Authors: We appreciate this observation. The local linear convergence result (Theorem 4.2) relies on the dual objective satisfying uniform smoothness and strong convexity in a neighborhood of the iterates, which in turn requires the soft weights to be bounded away from zero. This holds under the radii used in our experiments and when clusters are reasonably separated, but the manuscript does not state an explicit non-degeneracy assumption. In the revision we will add a remark to the theorem statement clarifying this condition, together with a brief discussion of regimes where the weights remain positive (e.g., when the Wasserstein radius is not large relative to inter-cluster separation). We will also confirm that the weights stay bounded away from zero in the reported experiments. This clarification strengthens the presentation while preserving the main claims and algorithm. revision: yes

Circularity Check

0 steps flagged

No significant circularity; standard DRO dualization and new algorithm

full rationale

The derivation begins from an external modeling choice (Wasserstein-2 ball around the empirical measure) and proceeds via standard minimax duality to obtain a soft-clustering objective, followed by a newly proposed block coordinate descent procedure whose monotonicity and local linear convergence are asserted via direct analysis of the resulting non-smooth but convex program. No step equates a claimed prediction or uniqueness result to a fitted parameter or prior self-citation by construction; the radius remains a free hyperparameter, the dual is obtained from first-principles Lagrangian arguments, and the convergence claims rest on properties of the dualized loss rather than tautological renaming or self-referential fitting. The paper is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on the existence of a tractable dual for the minimax problem over the Wasserstein ball and on standard properties of Wasserstein distance; the radius of the ball is a free hyperparameter.

free parameters (1)

Wasserstein ball radius
Controls the size of the ambiguity set and must be chosen or tuned; directly affects the worst-case objective.

axioms (1)

standard math Wasserstein-2 distance defines a valid metric on probability measures
Invoked to construct the ambiguity set around the empirical distribution.

pith-pipeline@v0.9.0 · 5428 in / 1062 out tokens · 24153 ms · 2026-05-10T16:34:44.612231+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

33 extracted references · 9 canonical work pages

[1]

Some methods for classification and analysis of multivariate observations,

J. MacQueenet al., “Some methods for classification and analysis of multivariate observations,” inProceedings of the fifth Berkeley symposium on mathematical statistics and probability, vol. 1-14. Oakland, CA, USA, 1967, pp. 281–297. 34

1967
[2]

k-means++: the advantages of careful seeding,

D. Arthur and S. Vassilvitskii, “ k-means++: the advantages of careful seeding,” inProceedings of the Eighteenth Annual ACM-SIAM Symposium on Discrete Algorithms, ser. SODA ’07. USA: Society for Industrial and Applied Mathematics, 2007, p. 1027–1035

2007
[3]

Robust k-means: a theoretical revisit,

A. Georgogiannis, “Robust k-means: a theoretical revisit,” inProceedings of the 30th International Conference on Neural Information Processing Systems, ser. NIPS’16. Red Hook, NY , USA: Curran Associates Inc., 2016, pp. 2891–2899, event-place: Barcelona, Spain

2016
[4]

Adaptively robust and sparse $k$-means clustering,

H. Li, S. Sugasawa, and S. Katayama, “Adaptively robust and sparse $k$-means clustering,”Transactions on Machine Learning Research,
[5]

Available: https://openreview.net/forum?id=EhC84fT2yA

[Online]. Available: https://openreview.net/forum?id=EhC84fT2yA
[6]

Robust trimmed k-means,

O. Dorabiala, J. N. Kutz, and A. Y . Aravkin, “Robust trimmed k-means,”Pattern Recognition Letters, vol. 161, pp. 9–16, Sep. 2022. [Online]. Available: https://linkinghub.elsevier.com/retrieve/pii/S0167865522002185

2022
[7]

Greedy sampling for approximate clustering in the presence of outliers,

A. Bhaskara, S. Vadgama, and H. Xu, “Greedy sampling for approximate clustering in the presence of outliers,”Advances in Neural Information Processing Systems, vol. 32, 2019

2019
[8]

Improved Outlier Robust Seeding for k-means,

A. Deshpande and R. Pratap, “Improved Outlier Robust Seeding for k-means,” Sep. 2023, arXiv:2309.02710 [cs]. [Online]. Available: http://arxiv.org/abs/2309.02710

work page arXiv 2023
[9]

URLhttps://doi.org/10.1214/aos/1176345338

D. Pollard, “Strong consistency of K-means clustering,”The Annals of Statistics, vol. 9, no. 1, pp. 135–140, 1981. [Online]. Available: https://doi.org/10.1214/aos/1176345339

work page doi:10.1214/aos/1176345339 1981
[10]

Fast rates for empirical vector quantization,

C. Levrard, “Fast rates for empirical vector quantization,”Electronic Journal of Statistics, vol. 7, pp. 1716–1746, 2013. [Online]. Available: https://doi.org/10.1214/13-EJS822

work page doi:10.1214/13-ejs822 2013
[11]

Non-asymptotic bounds for vector quantization in Hilbert spaces,

——, “Non-asymptotic bounds for vector quantization in Hilbert spaces,”The Annals of Statistics, vol. 43, no. 2, pp. 592–619, 2015. [Online]. Available: https://doi.org/10.1214/14-AOS1293

work page doi:10.1214/14-aos1293 2015
[12]

Robust k-means clustering for distributions with two moments,

Y . Klochkov, A. Kroshnin, and N. Zhivotovskiy, “Robust k-means clustering for distributions with two moments,”The Annals of Statistics, vol. 49, no. 4, pp. 2206 – 2230, 2021. [Online]. Available: https://doi.org/10.1214/20-AOS2033

work page doi:10.1214/20-aos2033 2021
[13]

Fuzzy pca-guided robust k-means clustering,

K. Honda, A. Notsu, and H. Ichihashi, “Fuzzy pca-guided robust k-means clustering,”IEEE Transactions on Fuzzy Systems, vol. 18, no. 1, pp. 67–79, 2010

2010
[14]

KMD clustering: robust general-purpose clustering of biological data,

A. Zelig, H. Kariti, and N. Kaplan, “KMD clustering: robust general-purpose clustering of biological data,”Communications Biology, vol. 6, no. 1, p. 1110, Nov. 2023. [Online]. Available: https://www.nature.com/articles/s42003-023-05480-z

2023
[15]

Learning Feature Representations with K-Means,

A. Coates and A. Y . Ng, “Learning Feature Representations with K-Means,” inNeural Networks: Tricks of the Trade, G. Montavon, G. B. Orr, and K.-R. M ¨uller, Eds. Berlin, Heidelberg: Springer Berlin Heidelberg, 2012, vol. 7700, pp. 561–580, series Title: Lecture Notes in Computer Science. [Online]. Available: http://link.springer.com/10.1007/978-3-642-35289-8 30

work page doi:10.1007/978-3-642-35289-8 2012
[16]

Clustering High-Dimensional Data via Feature Selection,

T. Liu, Y . Lu, B. Zhu, and H. Zhao, “Clustering High-Dimensional Data via Feature Selection,”Biometrics, vol. 79, no. 2, pp. 940–950, Jun. 2023. [Online]. Available: https://academic.oup.com/biometrics/article/79/2/940-950/7513996

2023
[17]

A review of robust clustering methods,

L. A. Garc ´ıa-Escudero, A. Gordaliza, C. Matr ´an, and A. Mayo-Iscar, “A review of robust clustering methods,”Advances in Data Analysis and Classification, vol. 4, no. 2-3, pp. 89–109, Sep. 2010. [Online]. Available: http://link.springer.com/10.1007/s11634-010-0064-5

work page doi:10.1007/s11634-010-0064-5 2010
[18]

Robust k-means++,

A. Deshpande, P. Kacham, and R. Pratap, “Robust k-means++,” inProceedings of the 36th Conference on Uncertainty in Artificial Intelligence (UAI), ser. Proceedings of Machine Learning Research, J. Peters and D. Sontag, Eds., vol. 124. PMLR, Aug. 2020, pp. 799–808. [Online]. Available: https://proceedings.mlr.press/v124/deshpande20a.html

2020
[19]

Least squares quantization in PCM,

S. Lloyd, “Least squares quantization in PCM,”IEEE Transactions on Information Theory, vol. 28, no. 2, pp. 129–137, Mar 1982

1982
[20]

Quantizing for minimum distortion,

J. Max, “Quantizing for minimum distortion,”IRE Transactions on Information Theory, vol. 6, no. 1, pp. 7–12, Mar 1960

1960
[21]

The complexity of the generalized Lloyd–Max problem (corresp.),

M. Garey, D. Johnson, and H. Witsenhausen, “The complexity of the generalized Lloyd–Max problem (corresp.),”IEEE Transactions on Information Theory, vol. 28, no. 2, pp. 255–256, Mar 1982

1982
[22]

Villani,Optimal transport: old and new, ser

C. Villani,Optimal transport: old and new, ser. Grundlehren der mathematischen Wissenschaften. Berlin: Springer, 2009, no. 338, oCLC: ocn244421231

2009
[23]

Wasserstein distributionally robust regret-optimal control in the infinite-horizon,

T. Kargin, J. Hajar, V . Malik, and B. Hassibi, “Wasserstein distributionally robust regret-optimal control in the infinite-horizon,” 2023

2023
[24]

Wasserstein distributionally robust Kalman filtering,

S. Shafieezadeh Abadeh, V . A. Nguyen, D. Kuhn, and P. M. Mohajerin Esfahani, “Wasserstein distributionally robust Kalman filtering,” Advances in Neural Information Processing Systems, vol. 31, 2018

2018
[25]

Wasserstein distributionally robust optimization: Theory and applications in machine learning,

D. Kuhn, P. M. Esfahani, V . A. Nguyen, and S. Shafieezadeh-Abadeh, “Wasserstein distributionally robust optimization: Theory and applications in machine learning,” inOperations research & management science in the age of analytics. Informs, 2019, pp. 130–166

2019
[26]

A distributionally robust approach to Shannon limits using the Wasserstein distance,

V . Malik, T. Kargin, V . Kostina, and B. Hassibi, “A distributionally robust approach to Shannon limits using the Wasserstein distance,” arXiv preprint arXiv:2405.06528, 2024

work page arXiv 2024
[27]

On the Rate of Convergence in

N. Fournier and A. Guillin, “On the rate of convergence in Wasserstein distance of the empirical measure,”Probability Theory and Related Fields, vol. 162, no. 3-4, pp. 707–738, Aug. 2015. [Online]. Available: https://link.springer.com/10.1007/s00440-014-0583-7 35

work page doi:10.1007/s00440-014-0583-7 2015
[28]

Distributionally robust stochastic optimization with Wasserstein distance,

R. Gao and A. Kleywegt, “Distributionally robust stochastic optimization with Wasserstein distance,”Mathematics of Operations Research, vol. 48, no. 2, pp. 603–655, 2023

2023
[29]

The UCI machine learning repository

M. Kelly, R. Longjohn, and K. Nottingham, “The UCI machine learning repository.” [Online]. Available: https://archive.ics.uci.edu
[30]

On general minimax theorems

M. Sion, “On general minimax theorems.”Pacific Journal of Mathematics, vol. 8, no. 1, pp. 171 – 176, 1958, publisher: Pacific Journal of Mathematics, A Non-profit Corporation

1958
[31]

A block coordinate descent method for regularized multiconvex optimization with applications to nonnegative tensor factorization and completion,

Y . Xu and W. Yin, “A block coordinate descent method for regularized multiconvex optimization with applications to nonnegative tensor factorization and completion,”SIAM Journal on Imaging Sciences, vol. 6, no. 3, pp. 1758–1789, 2013

2013
[32]

Boyd and L

S. Boyd and L. Vandenberghe,Convex Optimization. Cambridge University Press, 2004

2004
[33]

J. F. Bonnans and A. Shapiro,Perturbation Analysis of Optimization Problems. Springer Science & Business Media, 2013

2013