The two--dimensional representation of documents which allows documents to be represented on a two-dimensional Cartesian plane has proved to be a valid visualization tool for \ac{ATC} for understanding the relationships between categories of textual documents, and to help users to visually audit the classifier and identify suspicious training data. In this paper, we analyze a specific use of this visualization approach in the case of the \ac{NB} model for text classification and the \ac{BIM} for text retrieval. For text categorization, a reformulation of the equation for the decision of classification has to be written in such a way that each coordinate of a document is the sum of two addends: a variable component $\mathrm{P}(d | c_i)$, and a constant component $\mathrm{P}(c_i)$, the prior of the category. When plotted on the Cartesian plane according to this formulation, the documents that are constantly shifted along the x-axis and the y-axis can be seen. This effect of shifting is more or less evident according to which \ac{NB} model, Bernoulli or multinomial, is chosen. For text retrieval, the same reformulation can be applied in the case of the \ac{BIM} model. The visualization help to understand what are the decisions that are taken in order to order the documents, in particular in the case of relevance feedback.

Using Scatterplots to Understand and Improve Probabilistic Models for Text Categorization and Retrieval

DI NUNZIO, GIORGIO MARIA
2009

Abstract

The two--dimensional representation of documents which allows documents to be represented on a two-dimensional Cartesian plane has proved to be a valid visualization tool for \ac{ATC} for understanding the relationships between categories of textual documents, and to help users to visually audit the classifier and identify suspicious training data. In this paper, we analyze a specific use of this visualization approach in the case of the \ac{NB} model for text classification and the \ac{BIM} for text retrieval. For text categorization, a reformulation of the equation for the decision of classification has to be written in such a way that each coordinate of a document is the sum of two addends: a variable component $\mathrm{P}(d | c_i)$, and a constant component $\mathrm{P}(c_i)$, the prior of the category. When plotted on the Cartesian plane according to this formulation, the documents that are constantly shifted along the x-axis and the y-axis can be seen. This effect of shifting is more or less evident according to which \ac{NB} model, Bernoulli or multinomial, is chosen. For text retrieval, the same reformulation can be applied in the case of the \ac{BIM} model. The visualization help to understand what are the decisions that are taken in order to order the documents, in particular in the case of relevance feedback.
File in questo prodotto:
Non ci sono file associati a questo prodotto.
Pubblicazioni consigliate

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11577/2377499
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 25
  • ???jsp.display-item.citation.isi??? 10
social impact