Verbumculus and the discovery of unusual words

Apostolico, Alberto; Gong, Fc; Lonardi, S.

doi:10.1007/BF02944783

Measures relating word frequencies and expectations have been constantly of interest in Bioinformatics studies. With sequence data becoming massively available, exhaustive enumeration of such measures have become conceivalbe, and yet pose significant computational burdeneven when limited to words of bounded maximum length. In addition, the display of the huge tables possibly resulting from these counts poses practical problems of visualization and inference. Verbumculus is a suite of software tools for the efficient and fast detection of over- or under-represented words in nucleotide sequences. The inner core ofVerbumculus rests on subtly interwoven properties of statistics, pattern matching and combinatories on words, that enable one to limit drastically anda priori the set of over-or under-represented candidate words of all lengths in a given sequence, thereby rendering it more feasible both to detect and visualize such words in a fast and practically useful way. This paper is devoted to the description of the facility at the outset and to report experimental results, ranging from simulations on synthetic data to the discovery of regulatory elements on the upstream regions of a set of genes of the yeast. The softwareVerbumculus is accessible at http://www.cs.ucr.edu/~stelo/Verbumculus/or http://wwwdbl. dei.unipd.it/Verbumculus/