Using Scatterplots to Understand and Improve Probabilistic Models for Text Categorization and Retrieval

Di Nunzio, Giorgio Maria

doi:10.1016/j.ijar.2009.01.002

The two--dimensional representation of documents which allows documents to be represented on a two-dimensional Cartesian plane has proved to be a valid visualization tool for \ac{ATC} for understanding the relationships between categories of textual documents, and to help users to visually audit the classifier and identify suspicious training data. In this paper, we analyze a specific use of this visualization approach in the case of the \ac{NB} model for text classification and the \ac{BIM} for text retrieval. For text categorization, a reformulation of the equation for the decision of classification has to be written in such a way that each coordinate of a document is the sum of two addends: a variable component $\mathrm{P}(d | c_i)$, and a constant component $\mathrm{P}(c_i)$, the prior of the category. When plotted on the Cartesian plane according to this formulation, the documents that are constantly shifted along the x-axis and the y-axis can be seen. This effect of shifting is more or less evident according to which \ac{NB} model, Bernoulli or multinomial, is chosen. For text retrieval, the same reformulation can be applied in the case of the \ac{BIM} model. The visualization help to understand what are the decisions that are taken in order to order the documents, in particular in the case of relevance feedback.