When the aim of a study is comparing and contrasting texts of the same genre and achieving a good arrangement for a text clustering, we often resort to lexical-based approaches and appropriate measures of similarity/distance (Burrows, 2002; Juola, 2008; Rudman, 1998, Stamatatos, 2009; Labbé and Labbé, 2001; Tuzzi, 2010) between texts, e.g. cosine similarity, Burrows's Delta, Labbé's intertextual distance, etc. Given the properties and the formula of a distance, we obtain a square matrix that includes n×n cells and n(n-1)/2 positive non-zero non-redundant values that can be exploited for an automatic classification of the n available texts. This distance matrix might be read from an alternative perspective, i.e. as a ranking system: for each text we can sort all the other n-1 texts from the closest to the furthest. The distribution of these ranks among texts represents an interesting object of research (Alvo and Yu, 2014) when we consider the whole corpus and also when we observe groups of texts that share some properties (e.g. they belong to the same author). A preliminary experiment involved a large corpus of contemporary Italian novels and showed that we can indentify some novels that systematically top positions in all rankings and prove to be close to most of the available texts; on the contrary, we have novels that do not show strong similarities in any list and systematically lie in the furthest positions. This study compared results achieved through different measures and formulated some hypothesis to understand when in text clustering it is worth either to distinguish "average" and "eccentric" novels or disregard them in in-depth investigations.

The importance of being earnest (and average)

Arjuna Tuzzi
2018

Abstract

When the aim of a study is comparing and contrasting texts of the same genre and achieving a good arrangement for a text clustering, we often resort to lexical-based approaches and appropriate measures of similarity/distance (Burrows, 2002; Juola, 2008; Rudman, 1998, Stamatatos, 2009; Labbé and Labbé, 2001; Tuzzi, 2010) between texts, e.g. cosine similarity, Burrows's Delta, Labbé's intertextual distance, etc. Given the properties and the formula of a distance, we obtain a square matrix that includes n×n cells and n(n-1)/2 positive non-zero non-redundant values that can be exploited for an automatic classification of the n available texts. This distance matrix might be read from an alternative perspective, i.e. as a ranking system: for each text we can sort all the other n-1 texts from the closest to the furthest. The distribution of these ranks among texts represents an interesting object of research (Alvo and Yu, 2014) when we consider the whole corpus and also when we observe groups of texts that share some properties (e.g. they belong to the same author). A preliminary experiment involved a large corpus of contemporary Italian novels and showed that we can indentify some novels that systematically top positions in all rankings and prove to be close to most of the available texts; on the contrary, we have novels that do not show strong similarities in any list and systematically lie in the furthest positions. This study compared results achieved through different measures and formulated some hypothesis to understand when in text clustering it is worth either to distinguish "average" and "eccentric" novels or disregard them in in-depth investigations.
2018
Qualico 2018 - Book of Abstracts
978-83-950966-0-0
File in questo prodotto:
Non ci sono file associati a questo prodotto.
Pubblicazioni consigliate

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11577/3271996
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus ND
  • ???jsp.display-item.citation.isi??? ND
social impact