Four of the most common limitations of the many available clustering methods are: i) the lack of a proper strategy to deal with outliers; ii) the need for a good a priori estimate of the number of clusters; iii) the lack of a method able to detect when partitioning of a specific data set is not appropriate; and iv) the dependence of the result on the initialization. We propose Cross-clustering (CC), a partial clustering algorithm that overcomes these four limitations by combining the principles of two well established hierarchical clustering algorithms: Wards minimum variance and Complete-linkage. We validated CC by comparing it with a number of existing clustering methods, including Wards and Complete-linkage. We show on both simulated and real datasets, that CC performs better than the other methods in terms of: identification of the correct number of clusters, identification of outliers, and determination of real cluster memberships. We used CC to cluster samples in order to identify disease subtypes and on gene profiles, in order to determine groups of genes with the same behavior. Results obtained on a non-biological dataset show that the method is general enough to be successfully used in such diverse applications. New results on a modified version of the CC algorithm show that it could be suitable also for identify elongated clusters. The algorithm has been implemented in the statistical language R and is freely available from the CRAN repository.

Cross-Clustering: A partial clustering algorithm with automatic estimation of the number of clusters

Paola Tellaroli
;
Alessandra R. Brazzale;
2017

Abstract

Four of the most common limitations of the many available clustering methods are: i) the lack of a proper strategy to deal with outliers; ii) the need for a good a priori estimate of the number of clusters; iii) the lack of a method able to detect when partitioning of a specific data set is not appropriate; and iv) the dependence of the result on the initialization. We propose Cross-clustering (CC), a partial clustering algorithm that overcomes these four limitations by combining the principles of two well established hierarchical clustering algorithms: Wards minimum variance and Complete-linkage. We validated CC by comparing it with a number of existing clustering methods, including Wards and Complete-linkage. We show on both simulated and real datasets, that CC performs better than the other methods in terms of: identification of the correct number of clusters, identification of outliers, and determination of real cluster memberships. We used CC to cluster samples in order to identify disease subtypes and on gene profiles, in order to determine groups of genes with the same behavior. Results obtained on a non-biological dataset show that the method is general enough to be successfully used in such diverse applications. New results on a modified version of the CC algorithm show that it could be suitable also for identify elongated clusters. The algorithm has been implemented in the statistical language R and is freely available from the CRAN repository.
Programme and Abstracts
978-9963-2227-4-2
File in questo prodotto:
Non ci sono file associati a questo prodotto.
Pubblicazioni consigliate

Caricamento pubblicazioni consigliate

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: http://hdl.handle.net/11577/3262643
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus ND
  • ???jsp.display-item.citation.isi??? ND
social impact