Analisi di dati testuali cronologici in corpora diacronici: effetti della normalizzazione sul curve clustering

Trevisani, Matilde; Tuzzi, Arjuna

In bag-of-words approaches textual data are organized in words×texts contingency tables. Diachronic corpora include texts which have a chronological order and produce words×time-points contingency tables, i.e. the frequencies of each word in the text (or in the set of texts) that refers to each time-point. The temporal evolution of word frequencies is crucial to highlight the distinctive features of time spans as well as to cluster words portraying a similar temporal pattern. However, to take into account the fluctuating size of available texts for each time-point, the strong asymmetry of word frequencies and the general problem of data sparsity, a transformation of data is necessary. This study aims at examining how different data transformations affect curve clustering in terms of number and composition of word groups. A functional data approach that envisages a smoothing procedure (B-splines) combined with a distance-based curve clustering has been adopted. Examples are taken from the corpus of titles of scientific papers published by the Journal of the American Statistical Association (and its predecessors) in the time-span 1888-2012 and consist in the analysis of the life-cycle of 900 keywords through the timeline of 107 volumes.