Kernel methods for large-scale graph-based heterogeneous biological data integration

Tran, Van Dinh

The last decade has experienced a rapid growth in volume and diversity of biological data, thanks to the development of high-throughput technologies related to web services and embeded systems. It is common that information related to a given biological phenomenon is encoded in multiple data sources. On the one hand, this provides a great opportunity for biologists and data scientists to have more unified views about phenomenon of interest. On the other hand, this presents challenges for scientists to find optimal ways in order to wisely extract knowledge from such huge amount of data which normally cannot be done without the help of automated learning systems. Therefore, there is a high need of developing smart learning systems, whose input as set of multiple sources, to support experts to form and assess hypotheses in biology and medicine. In these systems, the problem of combining multiple data sources or data integration needs to be efficiently solved to achieve high performances. Biological data can naturally be represented as graphs. By taking graphs for data representation, we can take advantages from the access to a solid and principled mathematical framework for graphs, and the problem of data integration becomes graph-based integration. In recent years, the machine learning community has witnessed the tremendous growth in the development of kernel-based learning algorithms. Kernel methods whose kernel functions allow to separate between the representation of the data and the general learning algorithm. Interestingly, kernel representation can be applied to any type of data, including trees, graphs, vectors, etc. For this reason, kernel methods are a reasonable and logical choice for graph-based inference systems. However, there is a number of challenges for graph-based systems using kernel methods need to be effectively solved, including definition of node similarity measure, graph sparsity, scalability, efficiency, complementary property exploitation, integration methods. The contributions of the thesis aim at investigating to propose solutions that overcome the challenges faced when constructing graph-based data integration learning systems. The first contribution is the definition of a decompositional graph node kernel, named Conjunctive Disjunctive Node Kernel (CDNK), which intends to measure the similarities between nodes of graphs. Differently of existing graph node kernels that only exploit the topologies of graphs, the proposed kernel also utilizes the available information on the graph nodes. In CDNK, first, the graph is transformed into a set of linked connected components in which we distinguish between “conjunctive” links whose endpoints are in the same connected components and “disjunctive” links that connect nodes located in different connected components. Then the similarity between any couple of nodes is measured by employing a particular graph kernel on two neighborhood subgraphs rooted as each node. Next, it integrates the side information by applying convolution of the discrete information with the real valued vectors associated to graph nodes. Empirical evaluation shows that the kernel presents better performance compared to state-of-the-art graph node kernels. The second contribution aims at dealing with the graph sparsity problem. When working with sparse graphs, i.e graphs with a high number of missing links, the available information is not efficient to learn effectively. An idea to overcome this problem is to use link enrichment to enrich information for graphs. However, the performance of a link enrichment strongly depends on the adopted link prediction method. Therefore, we propose an effective link prediction method (JNSL). In this method, first, each link is represented as a joint neighborhood subgraphs. Then link prediction is considered as a binary classification. We empirically show that the proposed link prediction outperforms various other methods. Besides, we also present a method to boost the performance of diffusion-based kernels, which are most popularly used, by coupling kernel methods with link enrichment. Experimental results prove that the performances of diffusion-based graph node kernels are considerably improved by using link enrichment. The last contribution proposes a general kernel-based framework for graph integration that we name Graph-one. Graph-one is designed to overcome the challenges when handling with graph integration. In particular, it is a scalable and efficient framework. Besides, it is able to deal with unbanlanced settings where the number of positive and negative instances are much different. Numerous variations of Graph-one are evaluated in disease gene prioritization context. The results from experiments illustrate the power of the proposed framework. Precisely, Graph-one shows better performance than various methods. Moreover, Graph-one with data integration gets higher results than it with any single data source. It presents the effectiveness of Graph-one in exploiting the complementary property of graph integration.

Kernel methods for large-scale graph-based heterogeneous biological data integration / Tran, Van Dinh. - (2018 Apr 03).