MASS SPECTROMETRY-BASED PROTEOMICS: A 3D APPROACH TO DATA HANDLING AND QUANTIFICATION

Nasso, Sara

This thesis describes the Ph.D. research project in Bioengineering for Computational Proteomics carried out during the last three years (January 2008 - January 2011). Activities focused on design and development of methods for the analysis of Quantitative Mass Spectrometry-based Proteomics data. The Introduction briefly elucidates the main themes developed in the thesis and how the work was schemed. It reviews the computational issues associated to both data handling and quantification, and introduces the solutions proposed in the following. The first two chapters are introductory to the Proteomics and Mass Spectrometry field. The objective is to provide the reader with the information needed to understand Quantitative Mass Spectrometry-based Proteomics. In particular, Chapter 1 explains how proteomics was born, as the –omics science of proteins. Then proteomics main applications and goals are illustrated, which are ranging from clinics and pharmaceutics to systems biology. Chapter 2 shows the main technologies and instrumentation exploited in Mass Spectrometry-based proteomics. The most common experimental setups are reported: among them, the Liquid Chromatography-Mass Spectrometry (LC-MS) technique is thoroughly explained since it is the principal technique for Quantitative Mass Spectrometry-based Proteomics. The third Chapter presents the main concepts necessary to introduce the reader to the main topic of the PhD research Project, that is the development of bioinformatics tools for the handling and quantification of Mass Spectrometry-based Quantitative Proteomics data, focusing on LC-MS quantitative data and their analysis. Indeed, LC-MS data are highly informative for quantification aims, but challenging to parse. Data features that were pivotal for the design of the proposed solutions (i.e., the 3D structure of LC-MS data and the high quality profile acquisition) are highlighted. In the fourth Chapter, the state of art both for data handling and quantification is described and available standard data formats and software are illustrated as well as related open challenges. In Chapter 5, the dataset used to carry out the analyses is technically described. It consists of LC-MS data from a labeled controlled mixture of proteins with known quantification ratios, acquired in profile acquisition mode and in triplicates. In particular, this thesis presents 2 software solutions to address the handling and quantification of Quantitative Mass Spectrometry-based Proteomics data: mzRTree and 3DSpectra, respectively. Chapter 6 presents the solution proposed for the data handling issue. The proposal is a scalable 2D indexing approach implemented through an R-tree-based data structure, called mzRTree, that relies on a sparse matrix representation of the dataset, which is appropriate for LC-MS data, and more in generally for MS-based proteomics data. mzRTree allows efficient data access, storage and enables a computationally sustainable analysis of profile MS data. Regarding the quantification, which is one of the most relevant problem in mass spectrometry-based proteomics, Chapter 7 illustrates the solution proposed for the quantification problem: 3DSpectra. It is an innovative quantification algorithm for LC-MS labeled profile data exploiting both the 3-dimensionality of data and the profile acquisition. 3DSpectra fits on peptide data the 3D isotopic distribution model shaped by a Gaussian Mixture Model including a noise component, using the Expectation-Maximization approach. This model enables the software to both recognize the borders of the 3D isotopic distribution and reject noise. 3DSpectra is a reliable and accurate quantification strategy for labeled LC-MS data, providing significantly wide and reproducible proteome coverage. In the conclusion section of this thesis future and ongoing research work, regarding further development of both the mzRTree data structure and 3DSpectra quantification software, are discussed.

La presente tesi descrive il progetto di ricerca in Bioingegneria per la Proteomica Computazionale svolto durante i tre anni di dottorato (Gennaio 2008 - Gennaio 2011). L’attività di ricerca è stata incentrata sulla progettazione e lo sviluppo di metodi per l’analisi di dati di Proteomica basata su Spettrometria di Massa. Nell’introduzione si illustrano brevemente i temi principali trattati nella tesi, fornendo così lo schema del lavoro svolto. Si considerano quindi i 2 problemi principali associati all’analisi dati, cioè la gestione e quantificazione dei dati, e vengono presentate le soluzioni descritte nel prosieguo. I primi due capitoli sono introduttivi al settore della Proteomica e della Spettrometria di Massa. L’obiettivo è fornire al lettore tutte le informazioni necessarie per meglio comprendere la Proteomica Quantitativa basata su Spettrometria di Massa. Il Capitolo 1 spiega in che modo sia nata la Proteomica, ossia come il complemento proteico del genoma. Dopodiché, si espongono le principali applicazioni legate alla Proteomica e i suoi obiettivi, spaziando dagli aspetti clinici, alla farmaceutica, fino alla biologia dei sistemi. Il secondo Capitolo invece è legato agli aspetti tecnici e mostra le principali tecnologie e strumentazioni usate in Proteomica basata su Spettrometria di Massa. I setup sperimentali più comuni sono quindi illustrati e, tra tutti, ci si focalizza in particolare sulla Spettrometria di Massa abbinata a Cromatografia Liquida (LC-MS), che è la principale tecnica per esperimenti di Proteomica Quantitativa basata su Spettrometria di Massa. Il terzo Capitolo presenta i concetti fondamentali necessari per introdurre il lettore al tema principale del progetto di ricerca di Dottorato, ossia lo sviluppo di metodi bioinformatici per la gestione e la quantificazione di dati di Proteomica Quantitativa basata su Spettrometria di Massa, in particolare per l’analisi di dati quantitativi di LC-MS. Infatti, i dati di LC-MS hanno un alto contenuto informativo per scopi quantitativi, però sono estremamente problematici da analizzare. Sono quindi riassunti i setup sperimentali per la Proteomica Quantitativa basata su LC-MS così come le caratteristiche dei dati che sono state determinanti per lo sviluppo delle soluzioni proposte (ossia la struttura 3D dei dati LC-MS e l’alto contenuto informativo dei dati profile). Nel quarto Capitolo vengono descritti lo stato dell’arte, sia per la gestione che la quantificazione dei dati, e i relativi problemi aperti, che verranno trattati nei capitoli seguenti dove si propongono possibili soluzioni. Il Capitolo 5 è interamente dedicato alla descrizione tecnica dei dati utilizzati per validare le metodologie proposte. Si tratta di dati LC-MS generati da una mistura di proteine tracciate ed a rapporti di quantificazione note. Di ogni esperimento sono disponibili tre repliche. In particolare, questa tesi presenta 2 software per la gestione e la quantificazione di dati di Proteomica Quantitativa basata su Spettrometria di Massa. Il Capitolo 6 presenta la soluzione proposta per risolvere i problemi di gestione dati. Si tratta di un approccio di indicizzazione 2D scalabile che è stato implementato tramite una struttura dati basata sull’R-tree, chiamata mzRTree, e si basa sulla rappresentazione del dataset come matrice sparsa, che ben si adatta a dati di LC-MS e più in generale di Spettrometria di Massa. Nello specifico, mzRTree consente di accedere e memorizzare efficientemente i dati, rendendo così possibile un’analisi computazionalmente sostenibile di dati profile. Per quel che concerne la quantificazione, il Capitolo 7 illustra la soluzione proposta per il problema della quantificazione, 3DSpectra, un innovativo metodo di quantificazione che sfrutta sia la 3-dimensionalità dei dati LC-MS, sia l’alto contenuto informativo dei dati profile. 3DSpectra applica infatti un approccio 3D al riconoscimento della distribuzione isotopica del peptide da quantificare basato sul fit tramite l’algoritmo Expectation-Maximization di un Modello 3D a Mistura di Gaussiane. Tale modello consente di identificare i bordi del segnale da quantificare e di rigettare il rumore presente. 3DSpectra incorpora un’affidabile ed accurata strategia di quantificazione per dati LC-MS tracciati e acquisiti in modalità profile. Soprattutto, 3DSpectra offre, a livello di quantificazione, un’estesa e riproducibile copertura del proteoma. Nella sezione conclusiva della tesi si discute il lavoro futuro e in corso, che riguarda essenzialmente ulteriori sviluppi sia della struttura dati, mzRTree, che del software di quantificazione, 3DSpectra.

MASS SPECTROMETRY-BASED PROTEOMICS: A 3D APPROACH TO DATA HANDLING AND QUANTIFICATION / Nasso, Sara. - (2011 Jan 30).