Clustering word life-cycles in chronological corpora: what data transformation for differing clustering goals

Trevisani, Matilde; Tuzzi, Arjuna

Chronological corpora are collections of texts ordered in time. Texts are often grouped into equal time intervals and, in bag-of-words approaches, the processed data are typically the frequencies of individual words in the set of texts referred to the same time-point. The temporal course of a word occurrence is viewed as a proxy of a word diffusion and vitality, i.e. of a word life-cycle. Recognition of temporal shapes and clustering of words having similar life-cycles are the basic objective. However, the strong asymmetry of the frequency spectrum typical of textual data (Large Number of Rare Events) has to be taken into account when defining the specific purpose of clustering and, hence, the type of any further processing of the data.The crucial decision is whether similarity essentially depends on the degree of synchronization or also on the level of word popularity (in a functional data analysis approach, whether to compare curves horizontally or also in terms of their amplitude variation). Several column normalizations coupled with row normalizations are applied to the word×time contingency table of corpus data. By applying constrained spline smoothing and distance-based curve clustering, the effect of selected data transformations on the generation of word groups is examined.