Automated Text Categorization: The Two-Dimensional Probabilistic Model

DI NUNZIO, GIORGIO MARIA

doi:10.1007/978-3-540-75134-2_5

In \ac{ATC}, a general inductive process automatically builds a classifier for the categories involved in the process by observing the properties of a set of pre-classified documents; from these properties, the inductive process learns the characteristics that a new unseen document should have in order to be categorized under a specific category. Probabilistic models, such as the \acp{NB}, achieve a performance comparable to more sophisticated models, and they prove to be very efficient. \noindent We present the probabilistic model named \emph{Two-Dimensional Probabilistic Model} (2DPM) which starts from different hypotheses from those of the NB models: instead of independent events, terms are seen as disjoint events, and documents are represented as the union of these events. The set of probability measures defined in this model work in such a way that a reduction of the vocabulary of terms in order to reduce the complexity of the problem is ultimately not necessary. Moreover, the model defines a direct relationship between the probability of a document given a category of interest and a point on a two-dimensional space. In this light, it is possible to graph entire collections of documents on a Cartesian plane, and to design algorithms that categorize documents directly on this two-dimensional representation. This graphical representation has been useful to give insights into the development of the theoretical aspects of the 2DPM. Experiments on traditional test collections for ATC show that the 2DPM performs with a greater degree of statistical significance than the multinomial NB model.