Development and assessment of bioinformatics methods for personalized medicine

Reggiani, Francesco

The human genome is a source of information for researchers that study complex diseases with the perspective of a better understanding of the pathologies and the development of new therapeutic strategies. Starting from the beginning of the current century, a growing number of technologies devoted to DNA sequencing have emerged, generally referred to as Next Generation Sequencing (NGS) technologies. NGS gradually decreased the cost of sequencing a human genome to around US$1000, enabling the use of these technologies for clinical and research purposes, such as Genome-wide association studies (GWAS). GWAS studies have enlightened the presence of disease- associated loci, in particular variants that could be used to evaluate the risk of an individual to develop a disease. Unfortunately, different sources of errors are able to impair the interpretation and use of NGS data: on the one hand, we have noise related to the process of DNA sequencing and read alignment errors, which could lead to false positive calls or artifacts. On the other hand, variants could be poor predictors for the manifestation of their associated disease. Nowadays the challenge of genomic data interpretation has driven the research towards the development of methods for the analysis and interpretation of genomic variations, eventually predicting the probability of a patient to develop a definite disease. A fair evaluation of these tools is essential to understand the applicability of the presented methods in clinical practice. The Critical Assessment of Genome Interpretation (CAGI) has been developed with the aim of defining the current state of art in terms of methods for predicting the impact of genomic changes at molecular and phenotype levels. CAGI is a community-driven experiment in which different prediction methods, developed by a set of invited groups, are evaluated on a common dataset. Unfortunately, no common guidelines were given to evaluate the tools presented in CAGI experiments, this has made the comparison between different CAGI experiments cumbersome, since different mathematical indexes and scripts have been used to evaluate the involved methods. My PhD project has been focused on the development of software for the assessment of machine learning methods in regression and multiple phenotype challenges. This tool is based on state of the art assessment principles, derived from literature or previous CAGI experiments. This software is available as an R package and has been used to repeat or perform new assessments on a wide range of CAGI experiments. The knowledge acquired during the development of this project was used to evaluate two CAGI 5 challenges: Pericentriolar Material 1 (PCM1) and Intellectual Disability (ID) panel. The experience I have acquired, through the development of all previously mentioned works, has led the improvement and assessment of a machine learning method. In particular, I have developed a software for the prediction of cholesterol levels, based on genotype data. Eventually I have tested the reliability of this method. This tool was the milestone in a project founded by the Italian Ministry of Health.

Il genoma umano è una risorsa ricca di informazioni per i ricercatori che si dedicano allo studio delle patologie complesse. L’obiettivo di questo genere di ricerche è giungere ad una migliore comprensione di queste malattie e quindi sviluppare nuove strategie terapeutiche per la cura dei pazienti affetti. Dall’inizio di questo secolo, un numero crescente di tecnologie per il sequenziamento del DNA sono state sviluppate, sono conosciute come tecnologie “Next Generation Sequencing” (NGS). Le tecnologie NGS hanno gradualmente diminuito il costo del sequenziamento di un genoma umano fino a circa 1000 dollari, ciò ha consentito l’utilizzo di questi strumenti nella pratica clinica e nella ricerca, in particolare negli studi di associazione genome-wide o “Genome-wide association studies” (GWAS). Questi lavori hanno portato alla luce l’associazione di alcune varianti con alcune patologie o caratteri complessi. Queste varianti potrebbero essere utilizzate per valutare il rischio che un individuo sviluppi una particolare patologia. Sfortunatamente diverse sorgenti di errore sono in grado di ostacolare l’uso e l’interpretazione dei dati genomici: da una parte abbiamo il rumore legato al processo di sequenziamento e gli errori di allineamento delle reads. Dall’altra parte gli SNP non sempre possono essere utilizzati in modo affidabile per predire l’insorgenza della malattia a cui sono stati associati. Il Critical Assessment of Genome Interpretation è stato organizzato con l’obiettivo di definire lo stato dell’arte nei metodi che stimano l’effetto di variazioni genetiche a livello molecolare o fenotipico. Negli anni il CAGI ha dato vita a più competizioni in cui diversi gruppi di ricerca hanno testato i loro metodi di predizione su diversi dataset condivisi. L’assenza di linee generali su come condurre la valutazione delle performance dei predittori, ha reso difficile un confronto fra metodi sviluppati in edizioni diverse del CAGI. In questo contesto, il progetto di dottorato si è focalizzato nello sviluppo di un software per la valutazione di metodi di apprendimento automatici basati sulla regressione o la predizione di fenotipi multipli. Questo strumento si fonda su criteri di analisi della performance, derivanti dalla letteratura e da precedenti esperimenti del CAGI. Questo software è stato sviluppato in R ed utilizzato per ripetere o valutare ex novo la qualità dei predittori in un gran numero di esperimenti del CAGI. Le conoscenze acquisite durante lo sviluppo di questo progetto, sono state utilizzate per valutare due competizioni del CAGI 5: la Pericentriolar Material 1 (PCM1) e il Pannello per le Disabilità Intellettive (ID). L’esperienza derivante dal completamento dei lavori precedentemente elencati, ha guidato lo sviluppo e il miglioramento delle prestazioni di un metodo predittivo. In particolare è stato sviluppato un software per la predizione dei livelli di colesterolo, basato su dati genotipici, di cui è stata testata la validità con criteri matematici allo stato dell’arte. Questo strumento è stato la pietra portante di un progetto fondato dal Ministero della Salute Italiano.

Development and assessment of bioinformatics methods for personalized medicine / Reggiani, Francesco. - (2019 Nov 19).