Chronological corpora are collections of texts ordered in time. Texts are often grouped into equal time intervals and, in bag-of-words approaches, the processed data are typically the frequencies of individual words in the set of texts referred to the same time-point. The temporal course of a word occurrence is viewed as a proxy of a word diffusion and vitality, i.e. of a word life-cycle. Recognition of temporal shapes and clustering of words having similar life-cycles are the basic objective. However, the strong asymmetry of the frequency spectrum typical of textual data (Large Number of Rare Events) has to be taken into account when defining the specific purpose of clustering and, hence, the type of any further processing of the data.The crucial decision is whether similarity essentially depends on the degree of synchronization or also on the level of word popularity (in a functional data analysis approach, whether to compare curves horizontally or also in terms of their amplitude variation). Several column normalizations coupled with row normalizations are applied to the word×time contingency table of corpus data. By applying constrained spline smoothing and distance-based curve clustering, the effect of selected data transformations on the generation of word groups is examined.

Clustering word life-cycles in chronological corpora: what data transformation for differing clustering goals

TUZZI, ARJUNA
2015

Abstract

Chronological corpora are collections of texts ordered in time. Texts are often grouped into equal time intervals and, in bag-of-words approaches, the processed data are typically the frequencies of individual words in the set of texts referred to the same time-point. The temporal course of a word occurrence is viewed as a proxy of a word diffusion and vitality, i.e. of a word life-cycle. Recognition of temporal shapes and clustering of words having similar life-cycles are the basic objective. However, the strong asymmetry of the frequency spectrum typical of textual data (Large Number of Rare Events) has to be taken into account when defining the specific purpose of clustering and, hence, the type of any further processing of the data.The crucial decision is whether similarity essentially depends on the degree of synchronization or also on the level of word popularity (in a functional data analysis approach, whether to compare curves horizontally or also in terms of their amplitude variation). Several column normalizations coupled with row normalizations are applied to the word×time contingency table of corpus data. By applying constrained spline smoothing and distance-based curve clustering, the effect of selected data transformations on the generation of word groups is examined.
2015
Conference program and book of abstract
IFCS2015 Conference of the International Federation of Classification Societies
File in questo prodotto:
Non ci sono file associati a questo prodotto.
Pubblicazioni consigliate

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11577/3163506
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus ND
  • ???jsp.display-item.citation.isi??? ND
  • OpenAlex ND
social impact