Discriminant functional gene groups identiﬁcation with machine learning and prior knowledge

Zycinski, G; Squillario, M; Barla, A; Sanavia, Tiziana; Verri, A; Di Camillo, Barbara

High-throughput technologies allow to produce rapidly huge amount of gene expression data, useful to characterize wide variety of phenotypes. However, the choice of the best methods and approaches to analyze data from an high-throughput study is not a trivial aspect and makes the biological interpretation of the results a challenging task. The current analysis pipeline rst selects a set of genes which are somehow signicant for a specic aspect of the available data and then provides a functional characterization of the results using the many public sources of prior biological knowledge. This approach, while successful, may not be able to obtain the best possible results because, for example, it could discard some functions or processes, known to be relevant for the specic biological question under analysis, just based on the low number of genes in the identied list. We present a new analysis framework, Knowledge Driven Variable Selection , that integrates prior knowledge on data analysis. The expression data matrix is partitioned according to prior knowledge, into smaller matrices, easier to analyze and to interpret from both computational and biological viewpoints. Therefore KDVS, dierently from the current analysis pipeline, doesn't exclude a priori any function or process potentially relevant for the biological question under investigation. Three case studies have been presented to demonstrate the performance of the method.