Predicting COVID-19 via CT Scans

6 min readMay 6, 2021

Problem

Our main goal is to be able to determine if a patient has COVID or not based on a set of CT scans. CT scans can be helpful information in case patients have developed severe conditions and give a more detailed overview of the patients’ lungs than a regular swab test. Furthermore, swab testing is also prone to giving out false positives and needs multiple tests to ensure if a patient is virus-free or not. With a correct judgement from an CT scan, it allows us to visualize the patients condition better, while also providing doctors an update regarding the severity of the virus. Even though many strides have been done in the research of the virus, it is still relatively new, and the effects vary patient by patient.

Data

The dataset we are using is a binary set of 2482 CT-scan images. 1252 of these images are from patients with COVID-19, and the other 1230 are from non-COVID patients. CT scans combine multiple X-rays taken from different angles to get a more wholistic and accurate view than a traditional X-ray. This makes them a better candidate for image classification since that give us more information in the same number of pixels. All of the images were taken from patients in Sao Paulo, Brazil. This dataset was created by a team of British and Brazilian medical and computing professionals. The original intent of the data was to “encourage the research and development of artificial intelligent methods which are able to identify if a person is infected by SARS-CoV-2 through the analysis of his/her CT scans.”

Dataset: https://www.kaggle.com/plameneduardo/sarscov2-ctscan-dataset

FastAi Model

The image file names are read in and classes are determined using regular expressions with the built in Fast AI functions. They are then classes split into Covid and Non-Covid classes based on file name. We created a validation set containing 20% of the data. We then normalized the image data based on the stats of RGB channels when loaded.

After processing the data, we used a pre-trained torch-flow ResNet 34 model.

ResNet’s (short for Residual Networks) are a convolutional neural network used in computer vision. They specialize in mitigating the vanishing gradient problem and image degradation. An image comparison of ResNet vs other popular models is shown below.

After training the model, we achieved an accuracy of 89%. You can see our confusion matrix to the left. We were not satisfied with these results and needed to fine-tune.

We then found the optimal learning rate using Fast Ai’s built in lr-find method. After selecting 1e-3 as our new learning we retrained.

After fine-tuning we got an accuracy of ~93%. This result is shown in the confusion matrix.

Keras Model

We also decided to try a Keras model as we could only achieve 93% accuracy with FastAi.

When looking at our data there was no column that existed for patient ID, and some of the pictures looked very similar. We thought that it was possible that there was data leakage in our set where patients had multiple photos of both with and without covid. To solve for this problem, we used KMeans clustering of photos with similar appearances to prevent this leakage. We also then applied transformations randomly rotating, shearing, zooming, and shifting to make our model more robust to new images it may receive. In terms of building our model on Keras, we went with a DenseNet121 model.

This convolutional neural network is built for the problem of image classification. With this model we built standard layers before an after it: an input layer, a Conv2D layer to create a convolutional kernel that is convolved with the input layer. This is then fed into our densenet121 layer which pushes its results to two paths of average pooling which minimizes the sharp edges of our images and another path of max pooling which finds the brightest pixels in the image. These results are then combined and put through two iterations of batch normalization and dropout and dense layers where all neurons are connected in our neural net. After creating the model, we had to evaluate it on the validation set. To do this, we used threshold optimization.

Our model predicts the probability that a patient has Covid or not with a value between 0 and 1. We wanted to find the best threshold for classification. In other words, if our prediction was greater than the threshold, then it would be said that the patient has Covid and vice versa. This threshold optimization was done by trying many different thresholds and finding the best fit. In conclusion, we came to a threshold of 0.7507417. With this threshold we were able to receive scores of AUC: 0.9755327105522156 and Accuracy: 0.9470588235294117

Conclusion

We were able to successfully predict whether a patient had Covid or not with close to 95% accuracy. This is an outstanding result and has proven image classification is a viable method to diagnose COVID-19. While it seems like the pandemic is coming to an end, and more and more people have access to COVID tests, the lessons learned here could play a pivotal role in the future. We can use images to diagnose infections and diseases when traditional testing is not available or requires time.

Next Steps

Some possible next steps for this project are model tuning, model exploration, data collection, and expanding classes.

— Model tuning: Further tuning some hyper parameters within each model should yield better results, as seen when tweaking FastAI’s learning rate.

— Model Exploration: While only two models were researched and implemented in this project, it might be worth while to explore other common image classification models like VGG-16 and ResNet50 (50 layers instead of 34) among others.

— Data Collection: A model is only performant based on the data it has. As such, larger datasets would prove valuable in increasing the accuracy of the model.

— Expanding Classes: Adding classes corresponding to different lung diseases like pneumonia would add practicality to the model, since a patient with no covid would not necessarily be getting a CT scan since they might not be feeling sick. Differentiating between lung diseases would be a more practical implementation of this project down the road.

Contributors

Anthony Bazzini, Kushal Shah, Anirudh Anasuri, Hunter Pischke, Dillon Gentry, Paras Sharma