This study aims to analyse an Italian literary corpus from a diachronic perspective using machine learning methods. With reference to a basis of texts written between the 16th and the 21st century, the aim is to apply a well-known robust machine learning (ML) algorithm (Random Forest - RF) in order to see how the texts are classified in four different partitions, representing periodizations theorized by four Italian literature scholars. The corpus we employed for training the ML algorithm includes 420 Italian texts: 100 texts from the 16th century, 27 from the 17th, 57 from the 18th, 100 from the 19th, 100 from the 20th, and 36 from the 21st. In order to vectorize the texts, we used the Author’s Multilevel N-gram Profile (AMNP) (Mikros and Perifanos, 2013; Cortelazzo, Mikros, and Tuzzi, 2018), a document representation method that takes into account a diverse set of linguistic features (i.e., ngrams of increasing length – unigrams, bigrams, trigrams – and ngrams of increasing level – character, word). Each text was split into text chunks of 2000 words in length, and then it was transformed into AMNP vectors. The results of this research have shown an impressive accuracy in classification with the Random Forest algorithm since the precision in the four periodizations reached a minimum value of 89% in the partition-based Migliorini's theories and a maximum value of 97% in the partition based on Cella's ones. Looking at the misclassification cases, particularly in Migliorini's training, it's interesting to notice that when Random Forest makes a mistake in classifying text chunks into a century, its error is usually of +/- 1 century.

Does the Century Matter? Machine Learning Methods to Attribute Historical Periods in an Italian Literary Corpus

Michele A. Cortelazzo;Franco Gatti
;
Arjuna Tuzzi
2023

Abstract

This study aims to analyse an Italian literary corpus from a diachronic perspective using machine learning methods. With reference to a basis of texts written between the 16th and the 21st century, the aim is to apply a well-known robust machine learning (ML) algorithm (Random Forest - RF) in order to see how the texts are classified in four different partitions, representing periodizations theorized by four Italian literature scholars. The corpus we employed for training the ML algorithm includes 420 Italian texts: 100 texts from the 16th century, 27 from the 17th, 57 from the 18th, 100 from the 19th, 100 from the 20th, and 36 from the 21st. In order to vectorize the texts, we used the Author’s Multilevel N-gram Profile (AMNP) (Mikros and Perifanos, 2013; Cortelazzo, Mikros, and Tuzzi, 2018), a document representation method that takes into account a diverse set of linguistic features (i.e., ngrams of increasing length – unigrams, bigrams, trigrams – and ngrams of increasing level – character, word). Each text was split into text chunks of 2000 words in length, and then it was transformed into AMNP vectors. The results of this research have shown an impressive accuracy in classification with the Random Forest algorithm since the precision in the four periodizations reached a minimum value of 89% in the partition-based Migliorini's theories and a maximum value of 97% in the partition based on Cella's ones. Looking at the misclassification cases, particularly in Migliorini's training, it's interesting to notice that when Random Forest makes a mistake in classifying text chunks into a century, its error is usually of +/- 1 century.
2023
Quantitative Approaches to Universality and Individuality in Language
9783110628081
9783110763560
9783110763638
File in questo prodotto:
Non ci sono file associati a questo prodotto.
Pubblicazioni consigliate

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11577/3460641
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 1
  • ???jsp.display-item.citation.isi??? ND
social impact