The digitalization of clinical reports and the ever-growing usage of electronic health records make possible the collection of huge amounts of data. This data can be used to explore strategies to come in aid of both the patients and the clinical personnel, in terms of inference tools that could hint diagnostic decisions in a relevant manner, or as a general research pool. This project specifically makes use of reports of Computed Tomography Scans of patients with metastatic breast cancer. The aim of the thesis is to explore methods for multi-label text classification. The reports of interest are classified with a varying number of tags, depending on the location of the metastasis inferred from the report, that comes in the form of a free text description. To address this problem, I used a set of algorithms, namely logistic regression (multinomial and one-vs-rest), k-Nearest-Neighbors (with ’uniform’ and ’distance’ weight), Multi-k-Nearest-Neighbors, and Support Vector Classifier; these algorithms were fed with different types of word embeddings (TF-IDF and doc2vec). Moreover, the fastText library was explored in its integrated word embedding and text classification capabilities. At last, I used Fast-Bert, an open-source extension of Google’s BERT to specifically perform text classification.The results were not satisfying, due to the small size and the high class imbalance of the dataset. However, the investigation of different techniques has shed light to the promising possibilities of some of the strategies used.

Multi-label classification of computed tomography scan reports / Zampieri, Matteo. - (2019 Dec 20).

Multi-label classification of computed tomography scan reports

Zampieri, Matteo
2019-12-20

Abstract

The digitalization of clinical reports and the ever-growing usage of electronic health records make possible the collection of huge amounts of data. This data can be used to explore strategies to come in aid of both the patients and the clinical personnel, in terms of inference tools that could hint diagnostic decisions in a relevant manner, or as a general research pool. This project specifically makes use of reports of Computed Tomography Scans of patients with metastatic breast cancer. The aim of the thesis is to explore methods for multi-label text classification. The reports of interest are classified with a varying number of tags, depending on the location of the metastasis inferred from the report, that comes in the form of a free text description. To address this problem, I used a set of algorithms, namely logistic regression (multinomial and one-vs-rest), k-Nearest-Neighbors (with ’uniform’ and ’distance’ weight), Multi-k-Nearest-Neighbors, and Support Vector Classifier; these algorithms were fed with different types of word embeddings (TF-IDF and doc2vec). Moreover, the fastText library was explored in its integrated word embedding and text classification capabilities. At last, I used Fast-Bert, an open-source extension of Google’s BERT to specifically perform text classification.The results were not satisfying, due to the small size and the high class imbalance of the dataset. However, the investigation of different techniques has shed light to the promising possibilities of some of the strategies used.
Heltai, Luca
Bortolussi, Luca
File in questo prodotto:
File Dimensione Formato  
Zampieri.pdf

accesso aperto

Descrizione: MHPC Thesis
Tipologia: Tesi
Licenza: Non specificato
Dimensione 3.41 MB
Formato Adobe PDF
3.41 MB Adobe PDF Visualizza/Apri

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.11767/116065
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus ND
  • ???jsp.display-item.citation.isi??? ND
social impact