A new approach to the Text Categorization problem is here presented. It is called Gaussian Weighting and it is a supervised learning algorithm that, during the training phase, estimates two very simple and easily computable statistics which are: the Presence \emph{P}, how much a term \emph{t} is present in a category \emph{c}; the Expressiveness \emph{E}, how much \emph{t} is present outside \emph{c} in the rest of the domain. Once the system has learned this information, a Gaussian function is shaped for each term of a category, in order to assign the term a weight that estimates the level of its importance for that particular category. We tested our learning method on the task of single-label classification using the Reuters-21578 benchmark. The outcome of the result was quite impressive: in different experimental setups, we reached a micro-averaged F1-measure of 0.89, with a peak of 0.899. Moreover, a macro-averaged Recall and Precision was calculated: the former reported a 0.72, the latter a 0.79. These results reach most of the state-of-the-art techniques of machine learning applied to Text Categorization, demonstrating that this new weighting scheme does perform well on this particular task.
Does a New Simple Gaussian Weighting Approach Perform Well in Text Categorization?
DI NUNZIO, GIORGIO MARIA;
2003
Abstract
A new approach to the Text Categorization problem is here presented. It is called Gaussian Weighting and it is a supervised learning algorithm that, during the training phase, estimates two very simple and easily computable statistics which are: the Presence \emph{P}, how much a term \emph{t} is present in a category \emph{c}; the Expressiveness \emph{E}, how much \emph{t} is present outside \emph{c} in the rest of the domain. Once the system has learned this information, a Gaussian function is shaped for each term of a category, in order to assign the term a weight that estimates the level of its importance for that particular category. We tested our learning method on the task of single-label classification using the Reuters-21578 benchmark. The outcome of the result was quite impressive: in different experimental setups, we reached a micro-averaged F1-measure of 0.89, with a peak of 0.899. Moreover, a macro-averaged Recall and Precision was calculated: the former reported a 0.72, the latter a 0.79. These results reach most of the state-of-the-art techniques of machine learning applied to Text Categorization, demonstrating that this new weighting scheme does perform well on this particular task.Pubblicazioni consigliate
I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.