Optimal Amnesic Probabilistic Automata, or How to Learn and Classify Proteins in linear Time and Space

Apostolico, Alberto; Bejerano, G.

doi:10.1145/332306.332321

Statistical modeling of sequences is a central paradigm of machine learning that finds multiple uses in computational molecular biology and many other domains. The probabilistic automata typically built in these contexts are subtended by uniform, fixed-memory Markov models. In practice, such automata tend to be unnecessarily bulky and computationally imposing both during their synthesis and use. In [8], much more compact, tree-shaped variants of probabilistic automata are built which assume an underlying Markov process of variable memory length. In [3, 4], these variants, called Probabilistic Suffix Trees (PSTs) were successfully applied to learning and prediction of protein families. The process of learning the automaton from a given training set S of sequences requires &THgr; (Ln2) worst-case time, where n is the total length of the sequences in S and L is the length of a longest substring of S to be considered for a candidate state in the automaton. Once the automaton is built, predicting the likelihood of a query sequence of m characters may cost time &THgr; (m2) in the worst case. The main contribution of this paper is to introduce automata equivalent to PSTs but having the following properties: learning the automaton takes O (n) time. prediction of a string of m symbols by the automaton takes O (m) time. Along the way, the paper presents an evolving learning sheme, and addresses notions of empirical probability and related efficient computation,possibly a by-product of more general interest.