Computational Analysis and Annotation of Proteome Data: Sequence, Structure, Function and Interactions

Di Domenico, Tomás

With the advent of modern sequencing technologies, the amount of biological data available has begun to challenge our ability to process it. The development of new tools and methods has become essential for the production of results based on such a vast amount of information. This thesis focuses on the development of such computational tools and method for the study of protein data. I first present the work done towards the understanding of intrinsic protein disorder. Through the development of novel disorder predictors, we were able to expand the available data sources to cover any protein of known sequence. By storing these predicted annotations, together with data from other sources, we created MobiDB, a resource that provides a comprehensive view of available disorder annotations for a protein of interest, covering all sequences in the UniProt database. Based on observations obtained from this resource, we proceeded to create a data analysis workflow with the goal of furthering our understanding of intrinsic protein disorder. The second part focuses on tandem repeat proteins. The RAPHAEL method was developed to assist in the identification of tandem repeat protein structures from PDB files. Identified repeat structures were then manually classified into a formal classification schema, and published as part of the RepeatsDB database. Finally, I describe the development of network-based tools for the analysis of protein data. RING allows the user to visualise and study the structure of a protein as a network of nodes, linked by physico-chemical properties. The second method, PANADA, enables the user to create protein similarity networks and to assess the transferability of functional annotations between clusters of proteins.

Con l'avvento delle tecnologie di sequenziamento moderne, la quantità di dati biologici disponibili ha cominciato a sfidare la nostra capacità di elaborarli. È diventato quindi essenziale sviluppare nuovi strumenti e tecniche capaci di produrre dei risultati basati su grandi moli di informazioni. Questa tesi si concentra sullo sviluppo di tali strumenti computazionali e dei metodi per lo studio dei dati proteici. Viene dapprima presento il lavoro svolto per la comprensione delle proteine intrinsecamente disordinate. Attraverso lo sviluppo di nuovi predittori di disordine, siamo stati in grado di sfruttare le fonti di dati attualmente disponibili per annotare qualsiasi proteina avente sequenza nota. Memorizzando queste predizioni, insieme ai dati provenienti da altre fonti, è stato creato MobiDB. Questa risorsa fornisce una visione completa sulle annotazioni di disordine disponibili per una qualsiasi proteina di interesse presente nel database UniProt. Sulla base delle osservazioni ottenute da questo strumento, è stato quindi creato un workflow di analisi dei dati con l'obiettivo di approfondire la nostra comprensione delle proteine intrinsecamente disordinate. La seconda parte della tesi si concentra sulle proteine ripetute. Il metodo RAPHAEL è stato sviluppato per contribuire nell'identificazione di strutture proteiche ripetute all'interno dei file PDB. Le strutture selezionate da questo strumento sono state poi catalogate manualmente utilizzando uno schema formale di classificazione, e pubblicate quindi come parte del database RepeatsDB. Infine, viene descritto lo sviluppo di strumenti basati su grafi per l'analisi di dati proteici. RING consente all'utente di visualizzare e studiare la struttura di una proteina come una rete di nodi collegati da tra loro da proprietà fisico-chimiche. Il secondo metodo, PANADA, consente all'utente di creare reti di similarità di proteine e di valutare la trasferibilità delle annotazioni funzionali tra cluster diversi.

Computational Analysis and Annotation of Proteome Data: Sequence, Structure, Function and Interactions / Di Domenico, Tomás. - (2014 Jan 30).