Clustering is a pivotal primitive for unsupervised learning and data analysis. A popular variant is the (k, t degrees)- clustering problem, where, given a pointset P from a metric space, one must determine a subset S of k centers minimizing the sum of the t degrees-th powers of the distances of points in P from their closest centers. This formulation covers the well-studied k-median (t degrees t degrees = 1 ) and k-means (t degrees t degrees = 2 ) clustering problems. A more general variant, introduced to deal with noisy pointsets, features a further parameter z and allows up to z points of P (outliers) to be disregarded when computing the sum. We present a distributed coreset-based 3-round approximation algorithm for the (k, t degrees)-clustering problem with z outliers, using MapReduce as a computational model. An important feature of our algorithm is that it obliviously adapts to the intrinsic complexity of the dataset, captured by its doubling dimension D . Remarkably, for D = O (1), , our algorithm requires sublinear local memory per reducer, and yields a solution whose approximation ratio is an additive term O(y) away from the one achievable by the best known sequential (possibly bicriteria) algorithm, where y can be made arbitrarily small. To the best of our knowledge, no previous distributed approaches were able to attain similar quality-performance tradeoffs for metrics with constant doubling dimension.

MapReduce algorithms for robust center-based clustering in doubling metrics

Pietracaprina, Andrea;Pucci, Geppino
2024

Abstract

Clustering is a pivotal primitive for unsupervised learning and data analysis. A popular variant is the (k, t degrees)- clustering problem, where, given a pointset P from a metric space, one must determine a subset S of k centers minimizing the sum of the t degrees-th powers of the distances of points in P from their closest centers. This formulation covers the well-studied k-median (t degrees t degrees = 1 ) and k-means (t degrees t degrees = 2 ) clustering problems. A more general variant, introduced to deal with noisy pointsets, features a further parameter z and allows up to z points of P (outliers) to be disregarded when computing the sum. We present a distributed coreset-based 3-round approximation algorithm for the (k, t degrees)-clustering problem with z outliers, using MapReduce as a computational model. An important feature of our algorithm is that it obliviously adapts to the intrinsic complexity of the dataset, captured by its doubling dimension D . Remarkably, for D = O (1), , our algorithm requires sublinear local memory per reducer, and yields a solution whose approximation ratio is an additive term O(y) away from the one achievable by the best known sequential (possibly bicriteria) algorithm, where y can be made arbitrarily small. To the best of our knowledge, no previous distributed approaches were able to attain similar quality-performance tradeoffs for metrics with constant doubling dimension.
File in questo prodotto:
File Dimensione Formato  
1-s2.0-S0743731524001308-main.pdf

accesso aperto

Tipologia: Published (Publisher's Version of Record)
Licenza: Creative commons
Dimensione 835.71 kB
Formato Adobe PDF
835.71 kB Adobe PDF Visualizza/Apri
Pubblicazioni consigliate

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11577/3523765
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 1
  • ???jsp.display-item.citation.isi??? 1
  • OpenAlex ND
social impact