In bag-of-words approaches textual data are organized in words×texts contingency tables. Diachronic corpora include texts which have a chronological order and produce words×time-points contingency tables, i.e. the frequencies of each word in the text (or in the set of texts) that refers to each time-point. The temporal evolution of word frequencies is crucial to highlight the distinctive features of time spans as well as to cluster words portraying a similar temporal pattern. However, to take into account the fluctuating size of available texts for each time-point, the strong asymmetry of word frequencies and the general problem of data sparsity, a transformation of data is necessary. This study aims at examining how different data transformations affect curve clustering in terms of number and composition of word groups. A functional data approach that envisages a smoothing procedure (B-splines) combined with a distance-based curve clustering has been adopted. Examples are taken from the corpus of titles of scientific papers published by the Journal of the American Statistical Association (and its predecessors) in the time-span 1888-2012 and consist in the analysis of the life-cycle of 900 keywords through the timeline of 107 volumes.

Analisi di dati testuali cronologici in corpora diacronici: effetti della normalizzazione sul curve clustering

TUZZI, ARJUNA
2016

Abstract

In bag-of-words approaches textual data are organized in words×texts contingency tables. Diachronic corpora include texts which have a chronological order and produce words×time-points contingency tables, i.e. the frequencies of each word in the text (or in the set of texts) that refers to each time-point. The temporal evolution of word frequencies is crucial to highlight the distinctive features of time spans as well as to cluster words portraying a similar temporal pattern. However, to take into account the fluctuating size of available texts for each time-point, the strong asymmetry of word frequencies and the general problem of data sparsity, a transformation of data is necessary. This study aims at examining how different data transformations affect curve clustering in terms of number and composition of word groups. A functional data approach that envisages a smoothing procedure (B-splines) combined with a distance-based curve clustering has been adopted. Examples are taken from the corpus of titles of scientific papers published by the Journal of the American Statistical Association (and its predecessors) in the time-span 1888-2012 and consist in the analysis of the life-cycle of 900 keywords through the timeline of 107 volumes.
2016
JADT 2016 - Proceedings of 13th International Conference on Statitical Analysis of Textual Data
978-2-7466-9067-7
File in questo prodotto:
Non ci sono file associati a questo prodotto.
Pubblicazioni consigliate

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11577/3184450
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus ND
  • ???jsp.display-item.citation.isi??? ND
social impact