We study a problem related to the extraction of over-represented words from a given source text x, of length n. The words are allowed to occur with k mismatches, and x is produced by a source over an alphabet Σ according to a Markov chain of order p. We propose an online algorithm to compute the expected number of occurrences of a word y of length m in O(mk |Σ|^(p + 1)). We also propose an offline algorithm to compute the probability of any word that occurs in the text in O(k|Σ|^2) after O(nk |Σ|^(p + 1)) pre-processing. This algorithm allows us to compute the expectation for all the words in a text of length n in O(kn^2|Σ|^2 + nk |Σ|^(p + 1)), rather than in O(n^3 |Σ|^(p + 1)) that can be obtained with other methods. Although this study was motivated by the motif discovery problem in bioinformatics, the results find their applications in any other domain involving combinatorics on words.

Expectation of Strings with Mismatches Under Markov Chain Distribution

PIZZI, CINZIA;
2009

Abstract

We study a problem related to the extraction of over-represented words from a given source text x, of length n. The words are allowed to occur with k mismatches, and x is produced by a source over an alphabet Σ according to a Markov chain of order p. We propose an online algorithm to compute the expected number of occurrences of a word y of length m in O(mk |Σ|^(p + 1)). We also propose an offline algorithm to compute the probability of any word that occurs in the text in O(k|Σ|^2) after O(nk |Σ|^(p + 1)) pre-processing. This algorithm allows us to compute the expectation for all the words in a text of length n in O(kn^2|Σ|^2 + nk |Σ|^(p + 1)), rather than in O(n^3 |Σ|^(p + 1)) that can be obtained with other methods. Although this study was motivated by the motif discovery problem in bioinformatics, the results find their applications in any other domain involving combinatorics on words.
2009
Proceedings of the 16th Symposium on String Processing and Information Retrieval (SPIRE 2009)
9783642037832
File in questo prodotto:
Non ci sono file associati a questo prodotto.
Pubblicazioni consigliate

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11577/2382865
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 4
  • ???jsp.display-item.citation.isi??? 2
social impact