This study takes into account the issue of text clustering against the specific background of bag-of-words approaches and from different viewpoints. The most common algorithms for text clustering include instructions to summarise textual features in simple quantitative measures and use them to recognise the degree of similarity (or dissimilarity) between texts. These procedures involve several choices concerning the vocabularies of texts and measures of similarity. By comparing and contrasting the results obtained through eleven different procedures aimed at clustering the texts of three different corpora, this study discusses the importance of those choices and is focused on understanding for which environments they may be suitable.
What to put in the bag? Comparing and contrasting procedures for text clustering
TUZZI, ARJUNA
2010
Abstract
This study takes into account the issue of text clustering against the specific background of bag-of-words approaches and from different viewpoints. The most common algorithms for text clustering include instructions to summarise textual features in simple quantitative measures and use them to recognise the degree of similarity (or dissimilarity) between texts. These procedures involve several choices concerning the vocabularies of texts and measures of similarity. By comparing and contrasting the results obtained through eleven different procedures aimed at clustering the texts of three different corpora, this study discusses the importance of those choices and is focused on understanding for which environments they may be suitable.Pubblicazioni consigliate
I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.