This study takes into account the issue of text clustering against the specific background of bag-of-words approaches and from different viewpoints. The most common algorithms for text clustering include instructions to summarise textual features in simple quantitative measures and use them to recognise the degree of similarity (or dissimilarity) between texts. These procedures involve several choices concerning the vocabularies of texts and measures of similarity. By comparing and contrasting the results obtained through eleven different procedures aimed at clustering the texts of three different corpora, this study discusses the importance of those choices and is focused on understanding for which environments they may be suitable.

What to put in the bag? Comparing and contrasting procedures for text clustering

TUZZI, ARJUNA
2010

Abstract

This study takes into account the issue of text clustering against the specific background of bag-of-words approaches and from different viewpoints. The most common algorithms for text clustering include instructions to summarise textual features in simple quantitative measures and use them to recognise the degree of similarity (or dissimilarity) between texts. These procedures involve several choices concerning the vocabularies of texts and measures of similarity. By comparing and contrasting the results obtained through eleven different procedures aimed at clustering the texts of three different corpora, this study discusses the importance of those choices and is focused on understanding for which environments they may be suitable.
2010
File in questo prodotto:
Non ci sono file associati a questo prodotto.
Pubblicazioni consigliate

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11577/2498629
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus ND
  • ???jsp.display-item.citation.isi??? ND
social impact