Genome and exome sequencing projects produce huge amount of data, which in turns can yield extensive catalogues of human genetic variations. However, how to identify which genetic variations are implicated in the onset and progression of human diseases remains still a difficult task. New bioinformatic tools are required to efficiently spill out a small number of candidate variants from the large amounts of DNA sequencing data produced. Here we present the development of a platform designed to manage and retrieve data from human exome/genome sequencing projects. The platform integrates heterogeneous information to help the association of variations to the pathology/phenotype under study. The information can be related to gene features (Gene Ontology, Disease Ontology, OMIM, InterPro annotations), to genomic context, or it can describe the CDS-effects of variants (dbSNP, degree of deleteriousness) and their confidence in terms of depth of sequence coverage and calling score. The platform is accessible through a web interface where the user can upload one or more files containing the variants in VCF format. SNPs and microindels are automatically mapped on the genome and stored in a relational database together with their possible effects on the corresponding transcripts and proteins. A powerful and flexible query system allows then to explore the data applying different criteria which are related to the heterogeneous information stored in the database. The results of the processed query are displayed on a ranked list ordered according to how many of the imposed criteria are satisfied. Therefore the query and the ranking systems allow the user to filter the information at different levels and to directly assess the significance of the results. The web platform and the query system are based on a scalable and easily configurable XML-based language. This allows to easily face the continuous increase of data volume and heterogeneity and the subsequent database structure updates, without any modification of software code.

A web-based platform to retrieve user-ranked data from human exome/genome sequencing projects.

FORCATO, CLAUDIO;VITULO, NICOLA;VEZZI, ALESSANDRO;VALLE, GIORGIO
2012

Abstract

Genome and exome sequencing projects produce huge amount of data, which in turns can yield extensive catalogues of human genetic variations. However, how to identify which genetic variations are implicated in the onset and progression of human diseases remains still a difficult task. New bioinformatic tools are required to efficiently spill out a small number of candidate variants from the large amounts of DNA sequencing data produced. Here we present the development of a platform designed to manage and retrieve data from human exome/genome sequencing projects. The platform integrates heterogeneous information to help the association of variations to the pathology/phenotype under study. The information can be related to gene features (Gene Ontology, Disease Ontology, OMIM, InterPro annotations), to genomic context, or it can describe the CDS-effects of variants (dbSNP, degree of deleteriousness) and their confidence in terms of depth of sequence coverage and calling score. The platform is accessible through a web interface where the user can upload one or more files containing the variants in VCF format. SNPs and microindels are automatically mapped on the genome and stored in a relational database together with their possible effects on the corresponding transcripts and proteins. A powerful and flexible query system allows then to explore the data applying different criteria which are related to the heterogeneous information stored in the database. The results of the processed query are displayed on a ranked list ordered according to how many of the imposed criteria are satisfied. Therefore the query and the ranking systems allow the user to filter the information at different levels and to directly assess the significance of the results. The web platform and the query system are based on a scalable and easily configurable XML-based language. This allows to easily face the continuous increase of data volume and heterogeneity and the subsequent database structure updates, without any modification of software code.
2012
Personal Genomes & Medical Genomics
File in questo prodotto:
Non ci sono file associati a questo prodotto.
Pubblicazioni consigliate

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11577/3042928
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus ND
  • ???jsp.display-item.citation.isi??? ND
social impact