In textual analysis, many corpora include texts which have a chronological order and this temporal connotation is crucial to understand their structure. Common examples are: addresses delivered by institutional representatives in different years, articles retrieved from newspaper archives, literary works written by an author during his/her active life, essays written by a student in different steps of his/her educational experience, etc. In many applications, the temporal evolution of key topics, words, concepts, etc. are important in order to highlight the distinctive features of the chronological corpus. In a typical bag-of-words approach data are organized in word-type x time-point contingency tables which show the occurrences of each (word-type, time-point) crossing. It is of crucial concern, then, to recognize, for each word-type, the specific sequential pattern as well as to determine prototype patterns suitable for clustering word-types portraying a similar evolution. A very large number of word-types (vocabulary entries), often characterized by a relatively low number of observed time occurrences (commonly known as large p small n problem), moreover, sparsely spaced over time, poses many challenges from both statistical and computational perspectives. In this context sparsity is represented by a large number of zeros for most of the time interval length; these zero-cells are due to the large number of word-types with low number of corresponding word-tokens (intrinsic feature of textual data) as well as to the size of time-point subcorpora (the richness of information is highly variable across time in terms of different number of documents and of word-tokens therein). Such discrete data can be thought of as continuous objects represented by functional relationships. In functional data analysis the center of interest is a set of curves, shapes, objects or, more generally, a set of functional observations. The aim of this study is twofold: 1. to identify a specific sequential pattern for each word as a functional object; 2. to partition these curves (word patterns) in clusters, involving the definition of a suitable measure of similarity between curves. We propose a flexible model-based procedure with specific reference to a corpus of end-of-year addresses delivered by the ten Presidents of the Italian Republic in the period 1949-2010.

Shaping the History of Words

TUZZI, ARJUNA
2012

Abstract

In textual analysis, many corpora include texts which have a chronological order and this temporal connotation is crucial to understand their structure. Common examples are: addresses delivered by institutional representatives in different years, articles retrieved from newspaper archives, literary works written by an author during his/her active life, essays written by a student in different steps of his/her educational experience, etc. In many applications, the temporal evolution of key topics, words, concepts, etc. are important in order to highlight the distinctive features of the chronological corpus. In a typical bag-of-words approach data are organized in word-type x time-point contingency tables which show the occurrences of each (word-type, time-point) crossing. It is of crucial concern, then, to recognize, for each word-type, the specific sequential pattern as well as to determine prototype patterns suitable for clustering word-types portraying a similar evolution. A very large number of word-types (vocabulary entries), often characterized by a relatively low number of observed time occurrences (commonly known as large p small n problem), moreover, sparsely spaced over time, poses many challenges from both statistical and computational perspectives. In this context sparsity is represented by a large number of zeros for most of the time interval length; these zero-cells are due to the large number of word-types with low number of corresponding word-tokens (intrinsic feature of textual data) as well as to the size of time-point subcorpora (the richness of information is highly variable across time in terms of different number of documents and of word-tokens therein). Such discrete data can be thought of as continuous objects represented by functional relationships. In functional data analysis the center of interest is a set of curves, shapes, objects or, more generally, a set of functional observations. The aim of this study is twofold: 1. to identify a specific sequential pattern for each word as a functional object; 2. to partition these curves (word patterns) in clusters, involving the definition of a suitable measure of similarity between curves. We propose a flexible model-based procedure with specific reference to a corpus of end-of-year addresses delivered by the ten Presidents of the Italian Republic in the period 1949-2010.
2012
Book of Abstracts
File in questo prodotto:
Non ci sono file associati a questo prodotto.
Pubblicazioni consigliate

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11577/3034114
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus ND
  • ???jsp.display-item.citation.isi??? ND
social impact