The evaluation of agreement among experts in a classification task is crucial in many situations (e.g., medical and psychological diagnosis, legal reports). Traditional indexes used to estimate interrater agreement (such as Cohen’s j) simply count the number of observed agreements and correct them by removing chance agreements. In this article, we introduce a new theoretical framework for the evaluation of interrater agreement based on the possibility of adjusting the observed classifications conducted by the raters. This framework refers to the introduction and formalization of two concepts involved in the classification task: (a) the belonging measure of an object to a category and (b) the rater’s belonging threshold, which is the minimally sufficient value of the belonging measure at which the rater will classify an object into a category. These factors are ignored by traditional indexes for interrater agreement, though their role may be decisive. Two Bayesian models are tested through a Monte Carlo simulation study to evaluate the accuracy of the new methodology for estimating raters’ threshold and the actual degree of agreement between two independent raters. Results show that the computation of traditional indexes for interrater agreement on the adjusted classifications leads to a more accurate estimation of the experts’ actual agreement. This improvement is greater when a large difference between raters’ belonging thresholds is observed; when the difference is small, the proposed method provides similar results to those obtained in the simple observed classifications. Finally, an empirical application to the field of psychological assessment is presented to show how the method could be used in practice.

The Role of Raters Threshold in Estimating Interrater Agreement

Nucci M.;Spoto A.
;
Altoe G.;Pastore M.
2021

Abstract

The evaluation of agreement among experts in a classification task is crucial in many situations (e.g., medical and psychological diagnosis, legal reports). Traditional indexes used to estimate interrater agreement (such as Cohen’s j) simply count the number of observed agreements and correct them by removing chance agreements. In this article, we introduce a new theoretical framework for the evaluation of interrater agreement based on the possibility of adjusting the observed classifications conducted by the raters. This framework refers to the introduction and formalization of two concepts involved in the classification task: (a) the belonging measure of an object to a category and (b) the rater’s belonging threshold, which is the minimally sufficient value of the belonging measure at which the rater will classify an object into a category. These factors are ignored by traditional indexes for interrater agreement, though their role may be decisive. Two Bayesian models are tested through a Monte Carlo simulation study to evaluate the accuracy of the new methodology for estimating raters’ threshold and the actual degree of agreement between two independent raters. Results show that the computation of traditional indexes for interrater agreement on the adjusted classifications leads to a more accurate estimation of the experts’ actual agreement. This improvement is greater when a large difference between raters’ belonging thresholds is observed; when the difference is small, the proposed method provides similar results to those obtained in the simple observed classifications. Finally, an empirical application to the field of psychological assessment is presented to show how the method could be used in practice.
2021
File in questo prodotto:
Non ci sono file associati a questo prodotto.
Pubblicazioni consigliate

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11577/3410638
Citazioni
  • ???jsp.display-item.citation.pmc??? 0
  • Scopus 2
  • ???jsp.display-item.citation.isi??? 2
  • OpenAlex ND
social impact