In \ac{ATC}, a general inductive process automatically builds a classifier for the categories involved in the process by observing the properties of a set of pre-classified documents; from these properties, the inductive process learns the characteristics that a new unseen document should have in order to be categorized under a specific category. Probabilistic models, such as the \acp{NB}, achieve a performance comparable to more sophisticated models, and they prove to be very efficient. \noindent We present the probabilistic model named \emph{Two-Dimensional Probabilistic Model} (2DPM) which starts from different hypotheses from those of the NB models: instead of independent events, terms are seen as disjoint events, and documents are represented as the union of these events. The set of probability measures defined in this model work in such a way that a reduction of the vocabulary of terms in order to reduce the complexity of the problem is ultimately not necessary. Moreover, the model defines a direct relationship between the probability of a document given a category of interest and a point on a two-dimensional space. In this light, it is possible to graph entire collections of documents on a Cartesian plane, and to design algorithms that categorize documents directly on this two-dimensional representation. This graphical representation has been useful to give insights into the development of the theoretical aspects of the 2DPM. Experiments on traditional test collections for ATC show that the 2DPM performs with a greater degree of statistical significance than the multinomial NB model.

Automated Text Categorization: The Two-Dimensional Probabilistic Model

DI NUNZIO, GIORGIO MARIA
2008

Abstract

In \ac{ATC}, a general inductive process automatically builds a classifier for the categories involved in the process by observing the properties of a set of pre-classified documents; from these properties, the inductive process learns the characteristics that a new unseen document should have in order to be categorized under a specific category. Probabilistic models, such as the \acp{NB}, achieve a performance comparable to more sophisticated models, and they prove to be very efficient. \noindent We present the probabilistic model named \emph{Two-Dimensional Probabilistic Model} (2DPM) which starts from different hypotheses from those of the NB models: instead of independent events, terms are seen as disjoint events, and documents are represented as the union of these events. The set of probability measures defined in this model work in such a way that a reduction of the vocabulary of terms in order to reduce the complexity of the problem is ultimately not necessary. Moreover, the model defines a direct relationship between the probability of a document given a category of interest and a point on a two-dimensional space. In this light, it is possible to graph entire collections of documents on a Cartesian plane, and to design algorithms that categorize documents directly on this two-dimensional representation. This graphical representation has been useful to give insights into the development of the theoretical aspects of the 2DPM. Experiments on traditional test collections for ATC show that the 2DPM performs with a greater degree of statistical significance than the multinomial NB model.
2008
Information Access through Search Engines and Digital Libraries
9783540751335
File in questo prodotto:
Non ci sono file associati a questo prodotto.
Pubblicazioni consigliate

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11577/2271010
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus ND
  • ???jsp.display-item.citation.isi??? ND
social impact