- Artificial Intelligence
- Machine learning

## Table of Contents

Imagine some engineers come to a hospital one day and offer the use of their algorithm to detect tumors. This algorithm, they say, will detect benign and malignant tumors with 99% accuracy. The chief physician agrees to try it out because it will save time for his doctors and allow for faster detection of disease. However, it turns out that this system predicts a benign tumor, regardless of the input parameters. After all, it only has 99% accuracy since 99% of all tumors found are benign anyway.

This happens because the system is not working properly, as it was tested incorrectly. If the situation took place in real life, many patients could suffer from misdiagnosis. The model is not working because, for example, the Accuracy metric was used when evaluating the model, which is not suitable for the mentioned model.

You should always check how a model generalizes to invisible data. For example, in an enterprise setting, models should deliver real value to the business by delivering the highest levels of performance. In this article, we will try to explain common evaluation metrics for classification and regression ML problems.

## Why evaluate the performance of an ML model?

Tasks for ML projects are usually divided into two categories: based on classification algorithms, and based on regression algorithms. Classical examples for these tasks are the detection of offensive content as a task of a classification algorithm and the prediction of a stock price at a given point in time as a task of a regression algorithm. In both cases, the solution is to provide a trained ML model. But the model cannot be complete without evaluating its performance.

The right choice of evaluation metric is crucial for the machine learning (ML) algorithm. The choice of the actual metric depends on the ML task you are facing. Companies deploy ML models expecting them to give precise and reliable predictions for relevant use cases. If the wrong metric is used results will be unreliable and potentially damaging. Therefore, it is necessary to accurately assess how ML models generalize to test data.

The data that engineers must process to train models is constantly changing. With some ML project management approaches, the trained model may not perform well in the long run because it does not automatically adapt to changes in the datasets.

If we try to update the locked data, we will reduce the accuracy of the model. *Accuracy *is a determining factor when evaluating the accuracy of an ML project, especially in healthcare. The higher the accuracy, the better the model performs. However, it is worth noting that this is only true for balanced validation datasets. When dealing with imbalanced validation datasets, metrics such as *Precision*, *Recall*, and *F1 score* will be required. We will look at these and a few more metrics below.

## Classification metrics

Predictions for classification problems produce four types of results: true positives, true negatives, false positives, and false negatives. Let’s look at a few metrics for classification problems.

### Confusion matrix

The confusion matrix is the correlation between model predictions and actual data point class labels. It forms the basis for other types of classification metrics. This matrix fully describes the performance of the model. It also gives a detailed breakdown of correct and incorrect classifications for each class.

The forecasts in the matrix are divided into four groups. In the diagram, we see variants of the actual known answer and the predicted answer:

- Correct positive predictions — True positives. A scenario where positive predictions are indeed positive.
- Incorrect positive predictions — False positives. Positive predictions are actually negative.
- Correct negative predictions — True negatives. Negative predictions are indeed negative.
- Incorrect negative predictions — False negatives. A scenario where negative predictions are actually positive.

Typical metrics for classification problems are *Accuracy, Precision, Recall, False Positive Rate, *and* Specificity,* and these are derived from the Confusion Matrix. Each metric measures a different aspect of the predictive model.

Let’s take an example to better understand the confusion matrix. Let’s say we create a binary classification to separate images of unicorns from images of ordinary horses. Let’s assume our test set contains 1100 images (1000 non-unicorn images and 100 unicorn images) with the confusion matrix below.

Out of 100 unicorn images, the model correctly predicted 90 of them and misclassified 10 of them. If we treat the “unicorn” class as positive and the non-unicorn class as a negative class, then 90 samples predicted as unicorns are considered true positives, and 10 samples predicted as non-unicorns are false negatives.

Out of 1000 non-unicorn images, the model correctly classified 940 and misclassified 60 of them. 940 correctly classified samples are called true negatives, and 60 are called false positives.

As we can see, the diagonal elements of this matrix indicate the correct prediction for different classes, and the off-diagonal elements indicate misclassified samples.

### Accuracy

Accuracy is the most commonly used metric for model evaluation. It is the ratio between correctly (TRUE) predicted values and all results. This means that we can sum the diagonals of the matrix and divide them by the sum of all four results. However, it is not a clear measure of performance. For example, ML algorithms help doctors detect cancer at an early stage. Let’s assume that out of 100 patients, 90 do not have cancer and the remaining 10 do.

Medical professionals cannot afford to miss out on a patient who has cancer but goes unnoticed (false negative). Finding all people who do not have cancer gives an accuracy of 90%. Here the model did nothing but simply gave no cancer for all 100 predictions. So, relying on this metric alone is not a good way to evaluate this model.

### Precision

As you can see from the formula below, this metric is the percentage of positive instances out of the total predicted positive instances. The bottom of the formula is the model’s prediction made as positive from the entire given data set. It shows how correct the model is when it says it is correct.

Let’s take the field of preventive care as an example. It requires accuracy to predict in advance when a car needs to be repaired. The cost of maintenance is usually high, so incorrect forecasts can result in losses for the company. In such cases, the ability of the model to correctly classify the positive class and reduce false positives is paramount.

### Recall/Sensitivity/True Positive Rate

This one measures how many actual positives were predicted as positive. It is the probability that an actual ‘True’ case is predicted correctly. It is also known as Sensitivity or True positive rate (TPR).

It finds out how many extra actual positives the model missed when it showed the true positives. This metric can be useful in the fight against fraud. A high recall rate would indicate that many fraudulent activities were detected out of the total number of fraud cases.

### Specificity

Specificity represents the percentage of negative instances of the total number of actual negative instances. The bottom of the formula represents the actual number of negative instances present in the dataset.

This is similar to recall, but the shift occurs in negative instances. For example, to find out how many healthy patients did not have cancer and were told that they did not have cancer.

### F1 score

This metric combines precision and recall by considering their contribution. The higher the F1 score, the better. The F1 score tells us how precise the model is by letting us know how many correct classifications are made.

If the product in the numerator of the formula becomes low, the final F1 score drops significantly. Thus, the model performs well in the F1 score if the predicted positive values are actually positive (precision), and it does not miss positive results and predicts negative ones (recall).

### PR and ROC curves

One of the disadvantages of the F1 score is that both accuracy and recall are given the same value, which is why, according to our application, we may need one higher than the other, and the F1 score may not be an accurate metric for it. Therefore, looking at the PR or ROC (receiver operating characteristic) curves can help.

*PR curve*

PR curve is a curve between accuracy and recall for various thresholds. The diagram below will help you understand this. It shows six predictors showing their respective precision-recall curves for various thresholds. The upper right part of the graph is the space where we get high precision and recall. The predictor and threshold value are selected manually. PR AUC is the area under the curve. The higher its numerical value, the better. The research by Jonathan Cook and Vikram Ramadas showed that PR AUC can be used to predict corporate fraud.

*ROC curve*

The ROC curve denotes the performance of the receiver and is plotted against TPR and FPR for various thresholds. As TPR increases, FPR also increases.

As you can see in the first picture, we have four categories and we need a threshold that brings us closer to the top left corner. Comparing three predictors in a given dataset also becomes easy. You can select a threshold according to your application. ROC AUC is the area under the curve, the higher its numerical value, the better.

The ROC curve is often used in assessing the clinical performance of a biochemical test. It graphically shows the relationship/tradeoff between clinical sensitivity and specificity for each possible cutoff for a test or combination of tests. The area under the ROC curve gives an indication of the benefit of using the test.

## Regression metrics

Confusion matrix metric, F1 score metric — and curves — related to it, are related to the metrics of classification problems. They work with discrete data and are ideal for classification tasks since they are concerned with whether a prediction is correct. Let’s take a closer look at the regression problems. They work with continuous data where predictions are in a continuous range. Let’s examine these regression metrics.

### Root mean square error

Root Mean Square Error (RMSE) is the standard deviation of the residuals (prediction errors). Residuals are a measure of how far from the regression line data points are. RMSE is a measure of how to spread out these residuals are. In other words, it tells you how concentrated the data is around the line of best fit.

### Mean absolute error

The average value of the absolute difference between the original and predicted values is called the mean absolute error. This metric gives an estimate of how far the predictions were from the actual result.

### Mean squared error

Mean square error uses the mean of the square of the difference between the original and predicted values.

## Summing up

In an ideal scenario, the estimated performance of a model tells how well it performs on real-world data. Before choosing a metric, it is important to understand the context because each machine learning model tries to solve a problem with a specific goal using a specific data set.

The metrics mentioned in our article are just a few of the vast array of metrics for evaluating model performance. Often, one iteration may not be enough to get the model you want, and you may need to improve the model to get even better predictions. At Postindustria we offer a wide range of ML services and will help you develop a machine learning model that suits your needs.

Thank you for reaching out,

User!

Make sure to check for details.