In many quantitative linguistics applications scholars are interested in identifying a set of linguistic features which proves distinctive for a text or a class of texts with reference to a corpus or a model. Ordinary features are lexical-based elements (words, multi-words, n-grams, lemmas), part of speech categories, or further phonetic and morphosyntactic phenomena. Moreover, in many applications a set of distinctive features is selected a priori in order to achieve a qualitative reading of the texts or to be exploited in text clustering, topic modelling or content mapping tasks. Classification based on supervised Machine Learning (ML) algorithms is commonly used to classify texts (test set) on the basis of training data (training set). Thanks to large amounts of available, mixed, undifferentiated, multilevel, multilayer, and multipurpose features, ML generally provides an effective way to discriminate among existing classes and, then, to ascribe each new text to one of them. Although the accuracy of classification is often highly satisfactory, the distinctive features of each class remain only seldom explainable and transparent. The need to move from black-box procedures to explainable methods is at the basis of the distinction between ML and Statistical Learning (SL) approaches. Both SL and ML exploit data to make predictions but SL aims at a more in-depth understanding of data structures and relations among variables. From this perspective, SL methods capable of identifying the distinctive features of each class should interact with the solutions offered by ML algorithms in order to achieve a description in terms of linguistic similarities and differences. The umbrella term keyness is often used in text analysis to refer to different measures that reveal to what extent a word can be considered distinctive of a text or a text class. Many measures have been developed to meet the requirements of different perspectives, e.g. term frequency-inverse document frequency (TFIDF), log-likelihood and odd ratios, p-values based on the hypergeometric model (to mention just a few) and methods for keyword extraction, distance-based measures as well as solutions provided by Bayesian approaches and generative (topic) models. Starting from an established corpus of institutional speeches (corpus of End-of-Year Addresses of the Italian Presidents of the Republic 1949-2022) arranged by President-classes, this study explored the concept of keyness to highlight the strengths and weaknesses of different approaches, their consistency (overlapping) and how they can be applied in practice, particularly when working with large corpora. As most procedures are grounded on the observation of the occurrences reported in a term-document matrix (TDM), where terms represent features and documents represent texts or classes, most measures should tackle data normalization and dispersion problems (e.g. a linguistic feature should not be considered distinctive of a text as a whole when it occurs only within a specific portion, or of an entire class when it occurs only in one or a limited number of its texts). This work also shows to what extent procedures that exploit equal-sized text chunks samples and tailor-made normalizations of raw frequencies (with related diagnostic measures) play a fundamental role in improving results.

Capturing Distinctiveness: Transparent Procedures to Escape a Pervasive Black-Box Propensity

Arjuna Tuzzi
2023

Abstract

In many quantitative linguistics applications scholars are interested in identifying a set of linguistic features which proves distinctive for a text or a class of texts with reference to a corpus or a model. Ordinary features are lexical-based elements (words, multi-words, n-grams, lemmas), part of speech categories, or further phonetic and morphosyntactic phenomena. Moreover, in many applications a set of distinctive features is selected a priori in order to achieve a qualitative reading of the texts or to be exploited in text clustering, topic modelling or content mapping tasks. Classification based on supervised Machine Learning (ML) algorithms is commonly used to classify texts (test set) on the basis of training data (training set). Thanks to large amounts of available, mixed, undifferentiated, multilevel, multilayer, and multipurpose features, ML generally provides an effective way to discriminate among existing classes and, then, to ascribe each new text to one of them. Although the accuracy of classification is often highly satisfactory, the distinctive features of each class remain only seldom explainable and transparent. The need to move from black-box procedures to explainable methods is at the basis of the distinction between ML and Statistical Learning (SL) approaches. Both SL and ML exploit data to make predictions but SL aims at a more in-depth understanding of data structures and relations among variables. From this perspective, SL methods capable of identifying the distinctive features of each class should interact with the solutions offered by ML algorithms in order to achieve a description in terms of linguistic similarities and differences. The umbrella term keyness is often used in text analysis to refer to different measures that reveal to what extent a word can be considered distinctive of a text or a text class. Many measures have been developed to meet the requirements of different perspectives, e.g. term frequency-inverse document frequency (TFIDF), log-likelihood and odd ratios, p-values based on the hypergeometric model (to mention just a few) and methods for keyword extraction, distance-based measures as well as solutions provided by Bayesian approaches and generative (topic) models. Starting from an established corpus of institutional speeches (corpus of End-of-Year Addresses of the Italian Presidents of the Republic 1949-2022) arranged by President-classes, this study explored the concept of keyness to highlight the strengths and weaknesses of different approaches, their consistency (overlapping) and how they can be applied in practice, particularly when working with large corpora. As most procedures are grounded on the observation of the occurrences reported in a term-document matrix (TDM), where terms represent features and documents represent texts or classes, most measures should tackle data normalization and dispersion problems (e.g. a linguistic feature should not be considered distinctive of a text as a whole when it occurs only within a specific portion, or of an entire class when it occurs only in one or a limited number of its texts). This work also shows to what extent procedures that exploit equal-sized text chunks samples and tailor-made normalizations of raw frequencies (with related diagnostic measures) play a fundamental role in improving results.
2023
QUALICO 2023 Book of Abstracts
File in questo prodotto:
Non ci sono file associati a questo prodotto.
Pubblicazioni consigliate

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11577/3488521
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus ND
  • ???jsp.display-item.citation.isi??? ND
social impact