Rigorous and Efficient Algorithms for Mining Sequential Data, and Applications to Bioinformatics

Santoro, Diego

Data mining is the task of extracting meaningful information, i.e., patterns and new knowledge, from massive data. Data mining finds application in several domains, ranging from, e.g., e-commerce, social networks, and Internet of Things, to medicine and biology. In this Thesis we focus on the analysis of a specific type of data, i.e., sequential data, that are data composed of elements with an underlying intrinsic sequential nature among them. In particular, we study large datasets of sequential transactions, i.e., sequences of sets of items, where the items can represent, e.g., objects purchased by customers, and large datasets of k-mers, which are substrings of length k of a biological sequence. In data mining, there are two ways in which a dataset can be considered. In the first scenario, the dataset is considered as a sample drawn from the unknown generative process underlying the data, and it is studied to gain insights about the generative distribution. In the second scenario, the most common one in data mining, the dataset is studied in order to extract meaningful patterns that reside in it. In this Thesis we consider both knowledge discovery approaches. For the first scenario, we study datasets of sequential patterns in order to gain insights about their unknown generative process. The problem we consider is to mine the true frequent sequential patterns, which are those sequential patterns that are frequently generated by the generative process of the data. Since exact approaches are infeasible, given that the generative distribution is unknown, we propose algorithms to mine rigorous approximations of the true frequent sequential patterns (and their frequencies). Our algorithms are based on theoretical results that guarantee that their outputs are high-quality approximations, leveraging on the study of the Rademacher complexity, an advanced tool from statistical learning theory, of sequential patterns. We show the effectiveness of our algorithms with our extensive experimental evaluation on several real world datasets. For the second scenario, we study datasets representing biological sequences in order to mine meaningful k-mers residing in them. The problem we consider is to mine frequent k-mers, which are those k-mers that appear with a relatively high frequency in the dataset. Exact approaches to extract frequent k-mers exist, but they require high computational resources to analyze large datasets. Thus, we propose SPRISS, a sampling-based algorithm to rigorously approximate frequent k-mers, proving rigorous guarantees on the quality of its output. Our sampling scheme is based on the creation of a sample of reads, which is analyzed in order to gain insights about the original dataset. For our theoretical analysis, which provides the sample size required to obtain accurate estimates from SPRISS, we study the pseudodimension, an advanced concept from statistical learning theory, of k-mers in reads. We show the effectiveness of SPRISS with our extensive experimental evaluation on several real world datasets. Finally, we show that SPRISS can be used in several bioinformatics applications, like the comparison of metagenomic datasets, the discovery of discriminative k-mers, and the SNP genotyping, in order to speed-up the down-stream analyses, while achieving high-quality estimations of the exact results.

Rigorous and Efficient Algorithms for Mining Sequential Data, and Applications to Bioinformatics / Santoro, Diego. - (2023 Mar 20).