Background: With the progress of modern sequencing technologies a large number of complete genomes are now available. Traditionally the comparison of two related genomes is carried out by sequence alignment. There are cases where these techniques can not be applied, for example if two genomes do not share the same set of genes, or they are not alignable due to low sequence similarity or to their lengths. For these cases the comparison of complete genomes can be carried out only with ad hoc methods that are usually called alignment-free. Method: In this paper we propose a distance function based on subword compositions called Underlying Approach (UA). We prove that the matching statistics, a popular concept in field of string algorithms that captures the statistics of common words between two sequences, can be derived from a small set of ``independent'' subwords, namely the irredundant common subwords. We define a distance-like measures based on these subwords, such that each region of the genomes contributes only once, by avoiding to count subwords a multiple number of times. This filter will discard the subwords that occur in regions covered by other more significant subwords. Results The Underlying Approach (UA) builds a scoring function based on this set of patterns, called underlying. We prove that this set is by construction linear in the size of input, without overlaps, and it can be efficiently constructed. Results show the validity of our method in the reconstruction of phylogenetic trees, where the Underlying Approach outperforms the current state of the art methods. Moreover we show that the accuracy of UA is achieved with a very small number of subwords, that in some cases carry meaningful biological information.

Alignment-Free Phylogeny of Whole Genomes using Underlying Subwords

COMIN, MATTEO;
2012

Abstract

Background: With the progress of modern sequencing technologies a large number of complete genomes are now available. Traditionally the comparison of two related genomes is carried out by sequence alignment. There are cases where these techniques can not be applied, for example if two genomes do not share the same set of genes, or they are not alignable due to low sequence similarity or to their lengths. For these cases the comparison of complete genomes can be carried out only with ad hoc methods that are usually called alignment-free. Method: In this paper we propose a distance function based on subword compositions called Underlying Approach (UA). We prove that the matching statistics, a popular concept in field of string algorithms that captures the statistics of common words between two sequences, can be derived from a small set of ``independent'' subwords, namely the irredundant common subwords. We define a distance-like measures based on these subwords, such that each region of the genomes contributes only once, by avoiding to count subwords a multiple number of times. This filter will discard the subwords that occur in regions covered by other more significant subwords. Results The Underlying Approach (UA) builds a scoring function based on this set of patterns, called underlying. We prove that this set is by construction linear in the size of input, without overlaps, and it can be efficiently constructed. Results show the validity of our method in the reconstruction of phylogenetic trees, where the Underlying Approach outperforms the current state of the art methods. Moreover we show that the accuracy of UA is achieved with a very small number of subwords, that in some cases carry meaningful biological information.
2012
File in questo prodotto:
Non ci sono file associati a questo prodotto.
Pubblicazioni consigliate

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11577/2533240
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 47
  • ???jsp.display-item.citation.isi??? 41
social impact