The computation of statistical indexes such as frequency counts, expected probabilities, and over/under-representation scores is a routine task shared among several biological problems. In fact, they often constitute the core operation of pattern discovery algorithms used to extract DNA or proteins motifs from a sequence or a set of related sequences. We are interested in the problem of computing indexes related to co-occurrences of elements in a sequence. This can be applied to discovery regulatory elements that can be modeled as dyad motifs, i.e. two conserved components that co-occur at some distance d. Exact methods and heuristics suffer from limitations arising respectively from the exponential time and space complexity, and from the lack of guarantee of finding the optimal solution. An alternative approach to the problem is the compact or conservative approach. This is based on the notion of maximality to partition the search space in a set of classes, and evaluate only one representative for each class. If we consider the space of all the strings in a sequence, the classes can be identified by all strings that share the same set of starting positions. The representative maximal element is the longest string in the class. The suffix tree data structure naturally detects such representatives as the strings spelled out by any path from the root to each of its nodes. When counting co-occurrences measuring head-to-head distance up to a value d, it suffices to compute co-occurrences count only between maximal representatives. Any pair that is left out is associated with a maximal pair that has been counted. Since there are O(n2) words in a sequence of length n, but only O(n) maximal representative, computing the co-occurrence count require only O(n2) time and space rather than O(n4) [Apostolico et al, 2004]. Tail-to-head distance between the two components might be more natural to consider in real application. To compute tail-to-head distance we first compute co-occurrence count between maximal for all possible head-to-head distances. This requires O(n3) time and space. Then we can compute tail-to-head distance for any pair of words in constant time. In order to assign a significance score to each dyad motif, besides efficient counting it is necessary to study the behavior of the expectation of pairs among each class. The results of our analysis showed that either considering a background probability derived from the input strings, or from an external source, the expectation has a constant or decreasing behavior when the length of the strings increases. This implies that the score associated with the maximal element is the maximum score that can be achieved. Experiments on dyad motifs discovery where the dataset consists in 8 gene families that are regulated by zinc-bicluster transcription factors, proved our method to be effective and much faster than the exhaustive approach [VanHelden et al, 2000].

Efficient Discovery of Significant Co-occurrences with an Application to Dyad Motif Finding(abstract)

PIZZI, CINZIA;
2007

Abstract

The computation of statistical indexes such as frequency counts, expected probabilities, and over/under-representation scores is a routine task shared among several biological problems. In fact, they often constitute the core operation of pattern discovery algorithms used to extract DNA or proteins motifs from a sequence or a set of related sequences. We are interested in the problem of computing indexes related to co-occurrences of elements in a sequence. This can be applied to discovery regulatory elements that can be modeled as dyad motifs, i.e. two conserved components that co-occur at some distance d. Exact methods and heuristics suffer from limitations arising respectively from the exponential time and space complexity, and from the lack of guarantee of finding the optimal solution. An alternative approach to the problem is the compact or conservative approach. This is based on the notion of maximality to partition the search space in a set of classes, and evaluate only one representative for each class. If we consider the space of all the strings in a sequence, the classes can be identified by all strings that share the same set of starting positions. The representative maximal element is the longest string in the class. The suffix tree data structure naturally detects such representatives as the strings spelled out by any path from the root to each of its nodes. When counting co-occurrences measuring head-to-head distance up to a value d, it suffices to compute co-occurrences count only between maximal representatives. Any pair that is left out is associated with a maximal pair that has been counted. Since there are O(n2) words in a sequence of length n, but only O(n) maximal representative, computing the co-occurrence count require only O(n2) time and space rather than O(n4) [Apostolico et al, 2004]. Tail-to-head distance between the two components might be more natural to consider in real application. To compute tail-to-head distance we first compute co-occurrence count between maximal for all possible head-to-head distances. This requires O(n3) time and space. Then we can compute tail-to-head distance for any pair of words in constant time. In order to assign a significance score to each dyad motif, besides efficient counting it is necessary to study the behavior of the expectation of pairs among each class. The results of our analysis showed that either considering a background probability derived from the input strings, or from an external source, the expectation has a constant or decreasing behavior when the length of the strings increases. This implies that the score associated with the maximal element is the maximum score that can be achieved. Experiments on dyad motifs discovery where the dataset consists in 8 gene families that are regulated by zinc-bicluster transcription factors, proved our method to be effective and much faster than the exhaustive approach [VanHelden et al, 2000].
2007
4th Annual RECOMB Satellite on Regulatory Genomics
File in questo prodotto:
Non ci sono file associati a questo prodotto.
Pubblicazioni consigliate

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11577/179632
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus ND
  • ???jsp.display-item.citation.isi??? ND
social impact