Although much is known about gene expression regulation in both Prokaryotes and Eukaryotes, this complex and fascinating mechanism still remains to be fully elucidated. The relatively recent advent of high-throughput techniques for studying transcription has made available an invaluable amount of data that can be used for genome-wide analysis using bioinformatics approaches. These computational methods have now become an integrative part of biological research. The different topics of this thesis are related to the development and application of computational methodologies to better understand the basis of genomic gene expression regulation at different levels. A first level of investigation regarded the relationships among chromosomal structure, expression profile and functional characteristics, focusing on genomic organization and structure. For this task, REEF (REgionally Enriched Features) software has been developed, designed to identify genomic regions enriched in specific features, such as a class or group of genes homogeneous for expression and/or functional characteristics. REEF can be used to detect density variations of specific features along the genome sequence, for example genomic regions with significant enrichment of genes which are co-expressed, differentially expressed, or related to particular molecular functions. Local feature enrichment is calculated using test statistic based on the hypergeometric distribution applied genome-wide by sliding windows and false discovery rate is used for controlling multiplicity. REEF has been applied to the study of genomic distribution of tissue-specific genes and to the analysis of gene differentially expressed when comparing different myeloid cell lines. These analyses identified clusters of tissue-specific genes in the human genome and positional enrichment of hemopoietic functional module-related genes. The second level of investigation regarded gene expression regulation at promoter level. Unknown transcription factor binding sites might be detected by searching for shared sequence elements in upstream regulatory regions of genes with common biological function and/or similar expression profile. In fact, genes with similar expression are frequently co-regulated and genes with related function are often similarly expressed. New methodologies for the identification of regulatory motifs in human promoters were developed and tested. Since a drawback of this approach is the exceedingly high number of results, the use of biological knowledge both before and after application of automated pattern discovery allowed the definition of a “sheltered environment” enhancing the specificity of the computational analysis. COOP (Clustering of Overlapping Patterns) software for the extraction of sequence motifs was developed and used to analyze genomic sequences of 1 Kb upstream of 91 retina specific genes, identifying a set of putative regulative motifs, frequently occurring in retina promoter sequences. Most of them are localized in the proximal portion of promoters and tend to be less variable in central region than in lateral regions and some of them are similar to known regulatory sequences. The performances of COOP were further evaluated by simulation approaches and by applying it to a standard positive control dataset, proposed by Tompa and colleagues for systematic evaluation and comparison of pattern discovery software. A webtool for the prediction of functional elements in promoter sequences, MOST (MOtif Searching web Tool), has been applied to different datasets under various testing conditions in order to study the influence of specific search parameters on results. Two groups of promoter sequences containing known regulatory signals were used as positive control datasets: the public yeast benchmark dataset of Tompa and colleagues and a custom produced dataset of 37 human promoter sequences, subgroups of which contained some instances of one of nine different signals. The testing of performances of the method on different benchmark datasets gave quite positive results. Taking the concepts behind COOP to a new level, a more rigorous methodology was developed for the identification of surprising and putatively regulatory motifs, by comparing their frequency in promoters sequences of co-expressed genes with that in a background set of sequences, representative of the whole set of human gene promoters. Promoter sequences are divided in overlapping regions, considered independently, for identifying positional bias in the arrangement of transcription factors binding sites along promoters. Due to the genome-wide characteristics of this approach, a new webtool for the automatic identification and retrieval of a high number of promoters in the human genome was also developed. This motif discovery methodology has been adopted to investigate structure of promoters of genes crucial during myeloid differentiation.

A bioinformatic and computational approach to regulation of genome function: integrated analysis of genome organization, promoter sequences and gene expression / Coppe, Alessandro. - (2008 Jan 29).

A bioinformatic and computational approach to regulation of genome function: integrated analysis of genome organization, promoter sequences and gene expression.

Coppe, Alessandro
2008

Abstract

Although much is known about gene expression regulation in both Prokaryotes and Eukaryotes, this complex and fascinating mechanism still remains to be fully elucidated. The relatively recent advent of high-throughput techniques for studying transcription has made available an invaluable amount of data that can be used for genome-wide analysis using bioinformatics approaches. These computational methods have now become an integrative part of biological research. The different topics of this thesis are related to the development and application of computational methodologies to better understand the basis of genomic gene expression regulation at different levels. A first level of investigation regarded the relationships among chromosomal structure, expression profile and functional characteristics, focusing on genomic organization and structure. For this task, REEF (REgionally Enriched Features) software has been developed, designed to identify genomic regions enriched in specific features, such as a class or group of genes homogeneous for expression and/or functional characteristics. REEF can be used to detect density variations of specific features along the genome sequence, for example genomic regions with significant enrichment of genes which are co-expressed, differentially expressed, or related to particular molecular functions. Local feature enrichment is calculated using test statistic based on the hypergeometric distribution applied genome-wide by sliding windows and false discovery rate is used for controlling multiplicity. REEF has been applied to the study of genomic distribution of tissue-specific genes and to the analysis of gene differentially expressed when comparing different myeloid cell lines. These analyses identified clusters of tissue-specific genes in the human genome and positional enrichment of hemopoietic functional module-related genes. The second level of investigation regarded gene expression regulation at promoter level. Unknown transcription factor binding sites might be detected by searching for shared sequence elements in upstream regulatory regions of genes with common biological function and/or similar expression profile. In fact, genes with similar expression are frequently co-regulated and genes with related function are often similarly expressed. New methodologies for the identification of regulatory motifs in human promoters were developed and tested. Since a drawback of this approach is the exceedingly high number of results, the use of biological knowledge both before and after application of automated pattern discovery allowed the definition of a “sheltered environment” enhancing the specificity of the computational analysis. COOP (Clustering of Overlapping Patterns) software for the extraction of sequence motifs was developed and used to analyze genomic sequences of 1 Kb upstream of 91 retina specific genes, identifying a set of putative regulative motifs, frequently occurring in retina promoter sequences. Most of them are localized in the proximal portion of promoters and tend to be less variable in central region than in lateral regions and some of them are similar to known regulatory sequences. The performances of COOP were further evaluated by simulation approaches and by applying it to a standard positive control dataset, proposed by Tompa and colleagues for systematic evaluation and comparison of pattern discovery software. A webtool for the prediction of functional elements in promoter sequences, MOST (MOtif Searching web Tool), has been applied to different datasets under various testing conditions in order to study the influence of specific search parameters on results. Two groups of promoter sequences containing known regulatory signals were used as positive control datasets: the public yeast benchmark dataset of Tompa and colleagues and a custom produced dataset of 37 human promoter sequences, subgroups of which contained some instances of one of nine different signals. The testing of performances of the method on different benchmark datasets gave quite positive results. Taking the concepts behind COOP to a new level, a more rigorous methodology was developed for the identification of surprising and putatively regulatory motifs, by comparing their frequency in promoters sequences of co-expressed genes with that in a background set of sequences, representative of the whole set of human gene promoters. Promoter sequences are divided in overlapping regions, considered independently, for identifying positional bias in the arrangement of transcription factors binding sites along promoters. Due to the genome-wide characteristics of this approach, a new webtool for the automatic identification and retrieval of a high number of promoters in the human genome was also developed. This motif discovery methodology has been adopted to investigate structure of promoters of genes crucial during myeloid differentiation.
29-gen-2008
gene expression, funcional genomics, regulatory motifs, computational biology, bioinformatics
A bioinformatic and computational approach to regulation of genome function: integrated analysis of genome organization, promoter sequences and gene expression / Coppe, Alessandro. - (2008 Jan 29).
File in questo prodotto:
File Dimensione Formato  
Tesi.pdf

non disponibili

Tipologia: Tesi di dottorato
Licenza: Non specificato
Dimensione 3.53 MB
Formato Adobe PDF
3.53 MB Adobe PDF Visualizza/Apri   Richiedi una copia
Pubblicazioni consigliate

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11577/3426395
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus ND
  • ???jsp.display-item.citation.isi??? ND
social impact