In many applications of textual analysis corpora are characterized by a temporal structure, i.e. they include texts which have a chronological order (e.g.: political discourses, institutional documents, articles published in newspapers, messages posted to a blog, etc.). The temporal trend of key topics or words is crucial to disclose distinctive features of texts and of the corpus as a whole. In the frame of bag-of-words approaches, the temporal course of a word is represented as a sequence of frequencies across time, i.e. corresponds to a specific row of a word-type x time-point contingency table. Such discrete data can be thought of as a discrete observation of a curve, that is as a functional observation. In chronological corpora data are typically sparse over time. Thus, there are many cells in the contingency table with small counts or zeros. These zeros are due to the large number of word-types (vocabulary entries) with a relatively low number of associated word-tokens—intrinsic feature of textual data commonly known as large p small n problem—as well as to the size of time-point subcorpora. In terms of number of documents and of their size in word-tokens, the richness of information and the regularity of the corresponding signal could be highly variable across time. Time series represented by frequencies of words pose some specific issues: high-dimensional data, individual (word) variability, irregular and peak-like curves. Identifying the temporal patterns of words as functional curves, and clustering these into consistent groups with words portraying a similar pattern of evolution, are the main objectives of this study. In this work we focus on methods for model-based curve clustering in presence of the specific issues above mentioned. Curve clustering has longly been studied using splines, however they are not appropriate when dealing with high-dimensional data and cannot be used to model irregular functions such as spot and peak-like curves. On the contrary, wavelet representation can accomodate a wider range of functional shapes and proves more flexible than splines. In our work we suitably fit a recent class of wavelet-based functional clustering mixed models to the setting of chronological corpora. We consider for inference both a frequentist framework (resorting to the EM-algorithm for maximum likelihood estimation provided by the recently developed R package curvclust) and a Bayesian version. A further interesting issue consists in disentangling lower-scale patterns from the higher-level ones in order to detect the importance of a possible ”regime” factor (e.g. the President’s term of office in a corpus of end-of-year presidential addresses, see) relatively to the temporal evolution of a chronological corpus. We show that investigation into wavelet coefficients domain turns out to be useful to inspect on different scales of the process. A number of graphical tools are proposed to deal with such multiscale situations. Procedures are tested using different text genres: political and institutional discourses (written texts for oral delivery), press (written newspaper articles), literary works (ancient and modern narrative texts).

Functional model-based curve clustering for discovering temporal patterns in chronological corpora

TUZZI, ARJUNA
2012

Abstract

In many applications of textual analysis corpora are characterized by a temporal structure, i.e. they include texts which have a chronological order (e.g.: political discourses, institutional documents, articles published in newspapers, messages posted to a blog, etc.). The temporal trend of key topics or words is crucial to disclose distinctive features of texts and of the corpus as a whole. In the frame of bag-of-words approaches, the temporal course of a word is represented as a sequence of frequencies across time, i.e. corresponds to a specific row of a word-type x time-point contingency table. Such discrete data can be thought of as a discrete observation of a curve, that is as a functional observation. In chronological corpora data are typically sparse over time. Thus, there are many cells in the contingency table with small counts or zeros. These zeros are due to the large number of word-types (vocabulary entries) with a relatively low number of associated word-tokens—intrinsic feature of textual data commonly known as large p small n problem—as well as to the size of time-point subcorpora. In terms of number of documents and of their size in word-tokens, the richness of information and the regularity of the corresponding signal could be highly variable across time. Time series represented by frequencies of words pose some specific issues: high-dimensional data, individual (word) variability, irregular and peak-like curves. Identifying the temporal patterns of words as functional curves, and clustering these into consistent groups with words portraying a similar pattern of evolution, are the main objectives of this study. In this work we focus on methods for model-based curve clustering in presence of the specific issues above mentioned. Curve clustering has longly been studied using splines, however they are not appropriate when dealing with high-dimensional data and cannot be used to model irregular functions such as spot and peak-like curves. On the contrary, wavelet representation can accomodate a wider range of functional shapes and proves more flexible than splines. In our work we suitably fit a recent class of wavelet-based functional clustering mixed models to the setting of chronological corpora. We consider for inference both a frequentist framework (resorting to the EM-algorithm for maximum likelihood estimation provided by the recently developed R package curvclust) and a Bayesian version. A further interesting issue consists in disentangling lower-scale patterns from the higher-level ones in order to detect the importance of a possible ”regime” factor (e.g. the President’s term of office in a corpus of end-of-year presidential addresses, see) relatively to the temporal evolution of a chronological corpus. We show that investigation into wavelet coefficients domain turns out to be useful to inspect on different scales of the process. A number of graphical tools are proposed to deal with such multiscale situations. Procedures are tested using different text genres: political and institutional discourses (written texts for oral delivery), press (written newspaper articles), literary works (ancient and modern narrative texts).
2012
Workshop on Model-Based Clustering and Classification
File in questo prodotto:
Non ci sono file associati a questo prodotto.
Pubblicazioni consigliate

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11577/2531416
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus ND
  • ???jsp.display-item.citation.isi??? ND
social impact