Valuable information is stored in a healthcare record system and over 40% of it is estimated to be unstructured in the form of free clinical text. A collection of pathology records is provided by the Veneto Cancer Registry: these medical records refer to cases of melanoma and contain free text, in particular, the diagnosis. The aim of this research is to extract from the free text the size of the primary tumour, the involvement of lymph nodes, the presence of metastasis, and the cancer stage of the tumour. This goal is achieved with text mining techniques based on a supervised statistical approach. Since the procedure of information extraction from a free text can be traced back to a statistical classification problem, we apply several machine learning models in order to extract the variables mentioned above from the text. A gold standard for these variables is available: the clinical records have already been assessed case-by-case by an expert. The most efficient of the estimated models is the gradient boosting. Despite the good performance of gradient boosting, the classification error is not low enough to allow this kind of text mining procedures to be used in a Cancer Registry as it is proposed.

Staging cancer through text mining of pathology records

Belloni Pietro;Boccuzzo Giovanna;Guzzinati Stefano;Rossi Carlo R.;Rugge Massimo;Zorzi Manuel
2021

Abstract

Valuable information is stored in a healthcare record system and over 40% of it is estimated to be unstructured in the form of free clinical text. A collection of pathology records is provided by the Veneto Cancer Registry: these medical records refer to cases of melanoma and contain free text, in particular, the diagnosis. The aim of this research is to extract from the free text the size of the primary tumour, the involvement of lymph nodes, the presence of metastasis, and the cancer stage of the tumour. This goal is achieved with text mining techniques based on a supervised statistical approach. Since the procedure of information extraction from a free text can be traced back to a statistical classification problem, we apply several machine learning models in order to extract the variables mentioned above from the text. A gold standard for these variables is available: the clinical records have already been assessed case-by-case by an expert. The most efficient of the estimated models is the gradient boosting. Despite the good performance of gradient boosting, the classification error is not low enough to allow this kind of text mining procedures to be used in a Cancer Registry as it is proposed.
2021
Data Science and Social Research – II. Methods, Technologies and Applications
978-3-030-51221-7
File in questo prodotto:
File Dimensione Formato  
Belloni2021_Chapter_StagingCancerThroughTextMining.pdf

non disponibili

Tipologia: Published (publisher's version)
Licenza: Accesso privato - non pubblico
Dimensione 327.7 kB
Formato Adobe PDF
327.7 kB Adobe PDF Visualizza/Apri   Richiedi una copia
Pubblicazioni consigliate

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11577/3353104
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 1
  • ???jsp.display-item.citation.isi??? ND
social impact