Massive amounts of data are generated continuously in every fields, such as finance, social networks, medicine, and many others. Data mining is the task of discovering interesting patterns exploring large amounts of data, making those useful and understandable. In recent years, the data mining community has tackled and proposed a vast set of problems but many others are still open. Now more than ever, it is of crucial importance to be able to analyze and extract reliable knowledge from massive datasets. However, this fundamental task poses some challenges. The first one is to design algorithms that scale the computation to the analysis of massive datasets. In such a scenario, very often approaches that return rigorous and high quality approximations are the only viable approach. The second one is to develop strategies that extract useful knowledge providing statistical guarantees on the analysis, while filtering out spurious discoveries. Finally, the abundance of data is opening new scenarios, with a lot of sources generating data continuously, requiring analyses that take into account the sequential nature of the data. The objective of this Thesis is to design novel scalable and rigorous techniques to mine patterns from sequential data, in three scenarios. The first scenario we consider is mining frequent sequential patterns through sampling. Sequential pattern mining is a fundamental task in data mining and knowledge discovery, with applications in several areas (e.g., biology), that has been extensively studied in the literature, with the definition of several exact methods. However, for large modern sized datasets, the execution of exact methods is computationally very demanding. In such a direction, we develop an algorithm to mine rigorous approximations, defined in terms of false positives or false negatives, of the frequent sequential patterns using sampling. The second scenario we consider is mining patterns from samples from unknown probability distributions. In many real life applications (e.g., market basket analysis), the analysis of a dataset is performed to gain insight on the underlying generative process of the data. However, by analyzing only a sample, one cannot exactly solve the problem and has to resort to approximations. In this setting, we tackle two problems: the problem of mining, from a single dataset, true frequent sequential patterns, which are sequential patterns frequently generated by the underlying process generating the data; and the problem of mining statistically robust patterns from a sequence of datasets, which are patterns whose probabilities of being generated from the underlying generative processes behind the sequence of datasets follow well specified trends. For both problems, we develop novel algorithms that return rigorous approximations, defined in terms of false positives or false negatives. The last scenario we consider is mining significant patterns. In significant pattern mining, the dataset is seen as a sample from an unknown probability distribution, and the aim is to extract patterns significantly deviating from an assumed null hypothesis with rigorous guarantees in terms of statistical significance of the output, controlling the retrieval of false discoveries. In this setting, we tackle two problems: the problem of mining statistically significant sequential patterns and the problem of mining statistically significant paths in time series data from an unknown network. For both problems, we develop novel algorithms that provide rigorous guarantees in term of false discoveries, employing the statistical hypothesis testing framework and techniques based on permutation testing.

Massive amounts of data are generated continuously in every fields, such as finance, social networks, medicine, and many others. Data mining is the task of discovering interesting patterns exploring large amounts of data, making those useful and understandable. In recent years, the data mining community has tackled and proposed a vast set of problems but many others are still open. Now more than ever, it is of crucial importance to be able to analyze and extract reliable knowledge from massive datasets. However, this fundamental task poses some challenges. The first one is to design algorithms that scale the computation to the analysis of massive datasets. In such a scenario, very often approaches that return rigorous and high quality approximations are the only viable approach. The second one is to develop strategies that extract useful knowledge providing statistical guarantees on the analysis, while filtering out spurious discoveries. Finally, the abundance of data is opening new scenarios, with a lot of sources generating data continuously, requiring analyses that take into account the sequential nature of the data. The objective of this Thesis is to design novel scalable and rigorous techniques to mine patterns from sequential data, in three scenarios. The first scenario we consider is mining frequent sequential patterns through sampling. Sequential pattern mining is a fundamental task in data mining and knowledge discovery, with applications in several areas (e.g., biology), that has been extensively studied in the literature, with the definition of several exact methods. However, for large modern sized datasets, the execution of exact methods is computationally very demanding. In such a direction, we develop an algorithm to mine rigorous approximations, defined in terms of false positives or false negatives, of the frequent sequential patterns using sampling. The second scenario we consider is mining patterns from samples from unknown probability distributions. In many real life applications (e.g., market basket analysis), the analysis of a dataset is performed to gain insight on the underlying generative process of the data. However, by analyzing only a sample, one cannot exactly solve the problem and has to resort to approximations. In this setting, we tackle two problems: the problem of mining, from a single dataset, true frequent sequential patterns, which are sequential patterns frequently generated by the underlying process generating the data; and the problem of mining statistically robust patterns from a sequence of datasets, which are patterns whose probabilities of being generated from the underlying generative processes behind the sequence of datasets follow well specified trends. For both problems, we develop novel algorithms that return rigorous approximations, defined in terms of false positives or false negatives. The last scenario we consider is mining significant patterns. In significant pattern mining, the dataset is seen as a sample from an unknown probability distribution, and the aim is to extract patterns significantly deviating from an assumed null hypothesis with rigorous guarantees in terms of statistical significance of the output, controlling the retrieval of false discoveries. In this setting, we tackle two problems: the problem of mining statistically significant sequential patterns and the problem of mining statistically significant paths in time series data from an unknown network. For both problems, we develop novel algorithms that provide rigorous guarantees in term of false discoveries, employing the statistical hypothesis testing framework and techniques based on permutation testing.

Tecniche Rigorose e Scalabili per il Mining di Pattern da Dati Sequenziali / Tonon, Andrea. - (2022 Mar 11).

Tecniche Rigorose e Scalabili per il Mining di Pattern da Dati Sequenziali

TONON, ANDREA
2022

Abstract

Massive amounts of data are generated continuously in every fields, such as finance, social networks, medicine, and many others. Data mining is the task of discovering interesting patterns exploring large amounts of data, making those useful and understandable. In recent years, the data mining community has tackled and proposed a vast set of problems but many others are still open. Now more than ever, it is of crucial importance to be able to analyze and extract reliable knowledge from massive datasets. However, this fundamental task poses some challenges. The first one is to design algorithms that scale the computation to the analysis of massive datasets. In such a scenario, very often approaches that return rigorous and high quality approximations are the only viable approach. The second one is to develop strategies that extract useful knowledge providing statistical guarantees on the analysis, while filtering out spurious discoveries. Finally, the abundance of data is opening new scenarios, with a lot of sources generating data continuously, requiring analyses that take into account the sequential nature of the data. The objective of this Thesis is to design novel scalable and rigorous techniques to mine patterns from sequential data, in three scenarios. The first scenario we consider is mining frequent sequential patterns through sampling. Sequential pattern mining is a fundamental task in data mining and knowledge discovery, with applications in several areas (e.g., biology), that has been extensively studied in the literature, with the definition of several exact methods. However, for large modern sized datasets, the execution of exact methods is computationally very demanding. In such a direction, we develop an algorithm to mine rigorous approximations, defined in terms of false positives or false negatives, of the frequent sequential patterns using sampling. The second scenario we consider is mining patterns from samples from unknown probability distributions. In many real life applications (e.g., market basket analysis), the analysis of a dataset is performed to gain insight on the underlying generative process of the data. However, by analyzing only a sample, one cannot exactly solve the problem and has to resort to approximations. In this setting, we tackle two problems: the problem of mining, from a single dataset, true frequent sequential patterns, which are sequential patterns frequently generated by the underlying process generating the data; and the problem of mining statistically robust patterns from a sequence of datasets, which are patterns whose probabilities of being generated from the underlying generative processes behind the sequence of datasets follow well specified trends. For both problems, we develop novel algorithms that return rigorous approximations, defined in terms of false positives or false negatives. The last scenario we consider is mining significant patterns. In significant pattern mining, the dataset is seen as a sample from an unknown probability distribution, and the aim is to extract patterns significantly deviating from an assumed null hypothesis with rigorous guarantees in terms of statistical significance of the output, controlling the retrieval of false discoveries. In this setting, we tackle two problems: the problem of mining statistically significant sequential patterns and the problem of mining statistically significant paths in time series data from an unknown network. For both problems, we develop novel algorithms that provide rigorous guarantees in term of false discoveries, employing the statistical hypothesis testing framework and techniques based on permutation testing.
Rigorous and Scalable Techniques for Mining Patterns from Sequential Data
11-mar-2022
Massive amounts of data are generated continuously in every fields, such as finance, social networks, medicine, and many others. Data mining is the task of discovering interesting patterns exploring large amounts of data, making those useful and understandable. In recent years, the data mining community has tackled and proposed a vast set of problems but many others are still open. Now more than ever, it is of crucial importance to be able to analyze and extract reliable knowledge from massive datasets. However, this fundamental task poses some challenges. The first one is to design algorithms that scale the computation to the analysis of massive datasets. In such a scenario, very often approaches that return rigorous and high quality approximations are the only viable approach. The second one is to develop strategies that extract useful knowledge providing statistical guarantees on the analysis, while filtering out spurious discoveries. Finally, the abundance of data is opening new scenarios, with a lot of sources generating data continuously, requiring analyses that take into account the sequential nature of the data. The objective of this Thesis is to design novel scalable and rigorous techniques to mine patterns from sequential data, in three scenarios. The first scenario we consider is mining frequent sequential patterns through sampling. Sequential pattern mining is a fundamental task in data mining and knowledge discovery, with applications in several areas (e.g., biology), that has been extensively studied in the literature, with the definition of several exact methods. However, for large modern sized datasets, the execution of exact methods is computationally very demanding. In such a direction, we develop an algorithm to mine rigorous approximations, defined in terms of false positives or false negatives, of the frequent sequential patterns using sampling. The second scenario we consider is mining patterns from samples from unknown probability distributions. In many real life applications (e.g., market basket analysis), the analysis of a dataset is performed to gain insight on the underlying generative process of the data. However, by analyzing only a sample, one cannot exactly solve the problem and has to resort to approximations. In this setting, we tackle two problems: the problem of mining, from a single dataset, true frequent sequential patterns, which are sequential patterns frequently generated by the underlying process generating the data; and the problem of mining statistically robust patterns from a sequence of datasets, which are patterns whose probabilities of being generated from the underlying generative processes behind the sequence of datasets follow well specified trends. For both problems, we develop novel algorithms that return rigorous approximations, defined in terms of false positives or false negatives. The last scenario we consider is mining significant patterns. In significant pattern mining, the dataset is seen as a sample from an unknown probability distribution, and the aim is to extract patterns significantly deviating from an assumed null hypothesis with rigorous guarantees in terms of statistical significance of the output, controlling the retrieval of false discoveries. In this setting, we tackle two problems: the problem of mining statistically significant sequential patterns and the problem of mining statistically significant paths in time series data from an unknown network. For both problems, we develop novel algorithms that provide rigorous guarantees in term of false discoveries, employing the statistical hypothesis testing framework and techniques based on permutation testing.
Tecniche Rigorose e Scalabili per il Mining di Pattern da Dati Sequenziali / Tonon, Andrea. - (2022 Mar 11).
File in questo prodotto:
File Dimensione Formato  
tesi_definitiva_Andrea_Tonon.pdf

accesso aperto

Descrizione: tesi_definitiva_Andrea_Tonon
Tipologia: Tesi di dottorato
Dimensione 2.92 MB
Formato Adobe PDF
2.92 MB Adobe PDF Visualizza/Apri
Pubblicazioni consigliate

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11577/3445089
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus ND
  • ???jsp.display-item.citation.isi??? ND
social impact