Clustering is a pivotal primitive for unsupervised learning and data analysis. A popular variant is the (k, t degrees)- clustering problem, where, given a pointset P from a metric space, one must determine a subset S of k centers minimizing the sum of the t degrees-th powers of the distances of points in P from their closest centers. This formulation covers the well-studied k-median (t degrees t degrees = 1 ) and k-means (t degrees t degrees = 2 ) clustering problems. A more general variant, introduced to deal with noisy pointsets, features a further parameter z and allows up to z points of P (outliers) to be disregarded when computing the sum. We present a distributed coreset-based 3-round approximation algorithm for the (k, t degrees)-clustering problem with z outliers, using MapReduce as a computational model. An important feature of our algorithm is that it obliviously adapts to the intrinsic complexity of the dataset, captured by its doubling dimension D . Remarkably, for D = O (1), , our algorithm requires sublinear local memory per reducer, and yields a solution whose approximation ratio is an additive term O(y) away from the one achievable by the best known sequential (possibly bicriteria) algorithm, where y can be made arbitrarily small. To the best of our knowledge, no previous distributed approaches were able to attain similar quality-performance tradeoffs for metrics with constant doubling dimension.
MapReduce algorithms for robust center-based clustering in doubling metrics
Pietracaprina, Andrea;Pucci, Geppino
2024
Abstract
Clustering is a pivotal primitive for unsupervised learning and data analysis. A popular variant is the (k, t degrees)- clustering problem, where, given a pointset P from a metric space, one must determine a subset S of k centers minimizing the sum of the t degrees-th powers of the distances of points in P from their closest centers. This formulation covers the well-studied k-median (t degrees t degrees = 1 ) and k-means (t degrees t degrees = 2 ) clustering problems. A more general variant, introduced to deal with noisy pointsets, features a further parameter z and allows up to z points of P (outliers) to be disregarded when computing the sum. We present a distributed coreset-based 3-round approximation algorithm for the (k, t degrees)-clustering problem with z outliers, using MapReduce as a computational model. An important feature of our algorithm is that it obliviously adapts to the intrinsic complexity of the dataset, captured by its doubling dimension D . Remarkably, for D = O (1), , our algorithm requires sublinear local memory per reducer, and yields a solution whose approximation ratio is an additive term O(y) away from the one achievable by the best known sequential (possibly bicriteria) algorithm, where y can be made arbitrarily small. To the best of our knowledge, no previous distributed approaches were able to attain similar quality-performance tradeoffs for metrics with constant doubling dimension.File | Dimensione | Formato | |
---|---|---|---|
1-s2.0-S0743731524001308-main.pdf
accesso aperto
Tipologia:
Published (Publisher's Version of Record)
Licenza:
Creative commons
Dimensione
835.71 kB
Formato
Adobe PDF
|
835.71 kB | Adobe PDF | Visualizza/Apri |
Pubblicazioni consigliate
I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.