Understand Recall, Precision & ROC of your classifier, intuitively

If you work on Machine Learning projects, you might encounter these terminologies quite frequently.

In this post, I will explain them intuitively, and draw the ROC curve step by step in Python.


Accuracy = (TP + TN)/(TP + TN + FP + FN)

Accuracy alone is a bad measure for classification tasks. A predictive model may have high accuracy, but be useless.

For example, when the classifier is doing fraud detection, and there are 100 samples, 99 of the 100 are benign transactions. 1 of the 100 is fraudulent.

If the classifier is a one-liner code, which simply returns False for any input.

def isFraud(self, transaction):
    return False

You will still get 99% accuracy. This is called Accuracy Paradox.

Recall and Precision

Recall, formally defined as,

Recall = True positive / (True positive + False negative)

Or, to remember their meaning without thinking about true positive/false positive/false negative jargon, I conceptualize them as follows:

Say, if you are asked,

“Can you list the most recent 10 countries you have visited?”

So you recall.

And the Recall rate is a number of countries you can correctly recall to a number of all correct events.

If you can recall all 10 countries, you have a 1.0 recall rate (100%). If you can recall 7 countries correctly, you have a 0.7 recall rate.

However, you might be wrong in some answers.

For example, you answered 15 countries, 10 countries are correct and 5 are wrong. This means you have a 1.0 recall rate, but not so precise.

This is where Precision came into the picture.

precision = True positive / (True positive + False positive)

Precision is the ratio of a number of countries you correctly recalled to a number all countries you recalled(mix of right and wrong answers).

So it is not hard to imagine that when you increased the recall rate of your model, your precision rate will drop. Vice versa.


We now know that a model could be good at recall, but it does not it is also good at precision.

So F1-score is introduced to evaluate the model. It is the harmonic mean of the precision and recall.

The drawback of F1-score is that if you have a model with high recall and low precision. Another model with low recall and high precision. The two F1-score will end up very close to each other.

ROC curve & AUC

Now you have some idea about the Accuracy, the Recall and the Precision of a model.

By using these rates, the ROC curve evaluates the performance of models.

Let’s take a look at a concrete example where I draw ROC curves from scratch.

For example, you have already trained a classifier. By feeding your classifier with 20 labelled samples, you got a list of decisions from the classifier, like below,

# 0.9 means the classifier is 90% sure that this sample is True. 
y_score = [0.9, 0.8, 0.7, 0.6, 0.55, 0.54, 0.53, 0.52, 0.51, 0.505, 0.4, 0.39, 0.38, 0.37, 0.36, 0.35, 0.34, 0.33, 0.3, 0.1]

And their labels are below, where 1 represents the False label, 2 represents True label.

y_true =  [2, 2, 1, 2, 2, 2, 1, 1, 2, 1, 2, 1, 2, 1, 1, 1, 2, 1, 2, 1]

Now you have all ingredients ready to get the roc_curve. Behind the scene, the roc_curve() function will sort the score list in descending order. Like below, the source code used by Python’s sklearn

# sort scores and corresponding truth values
desc_score_indices = np.argsort(y_score, kind="mergesort")[::-1]

Then for each score, the function will treat the specific score as the threshold, and apply it to all samples.

In our example, if you take the score 0.6 as the threshold, then the first 4 samples (namely the samples scored 0.9, 0.8, 0.7, and 0.6) will be classified as Positive, and the rest 16 samples (which are less than 0.6) are classified as Negative. Thus, we can compute the overall TPR and FPR for the model and draw one point in the coordinates.

By repeating this for each score, we can get 20 points on the coordinates. By connecting them with (0, 0), (1, 1), we get the ROC curve.

Compute the ROC and AUC by sklearn in Python

# 1 represents False, 2 represents True
import numpy as np
import matplotlib.pyplot as plt
from sklearn.metrics import roc_curve, auc

y_true =  [2, 2, 1, 2, 2, 2, 1, 1, 2, 1, 2, 1, 2, 1, 1, 1, 2, 1, 2, 1]
y_score = [0.9, 0.8, 0.7, 0.6, 0.55, 0.54, 0.53, 0.52, 0.51, 0.505, 0.4, 0.39, 0.38, 0.37, 0.36, 0.35, 0.34, 0.33, 0.3, 0.1]

fpr, tpr, _ = roc_curve(y_true, y_score, pos_label=2)
>>> fpr
array([ 0. ,  0. ,  0.1,  0.1,  0.3,  0.3,  0.4,  0.4,  0.5,  0.5,  0.8, 0.8,  0.9,  0.9,  1. ])
>>> tpr
array([ 0.1,  0.2,  0.2,  0.5,  0.5,  0.6,  0.6,  0.7,  0.7,  0.8,  0.8, 0.9,  0.9,  1. ,  1. ])
>>> roc_auc = auc(fpr, tpr)
>>> roc_auc

Plot the ROC

plt.plot(fpr, tpr, color='darkorange', lw=2, label='ROC curve (area = %0.2f)' % roc_auc)
plt.plot([0, 1], [0, 1], color='navy', lw=2, linestyle='--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver operating characteristic example')
plt.legend(loc="lower right")

And now you have got the ROC chart of this specific classifier. The AUC (Area Under the Curve) of this classifier is 0.68. imgROC curve for classifier #1

Is the above classifier algorithm good? It is hard to draw a conclusion this far, but in general the greater your AUC is, the better your classifier performs overall.

Let’s say that you have modified your classifier, for example, by adding or pruning some of your features.

Your new y_socre is like below,

y_score = [0.9, 0.7, 0.7, 0.5, 0.55, 0.44, 0.53, 0.42, 0.51, 0.405, 0.4, 0.29, 0.38, 0.27, 0.36, 0.25, 0.34, 0.23, 0.3, 0.001] 

By repeating the ROC algorithm described above, we will get a new ROC curve with an AUC improved to 0.74.

In this case, we can say in general, your new algorithm out-performed the previous one. imgROC curve for classifier #2

Which threshold to choose to put into production?

Usually, the Kolmogorov–Smirnov test is used to get the threshold point where the value of (TPR - FPR) is maximized.

The KS test is un-biased. But most of the time, your classifier will have preference.

For example, an email spam classifier would prefer low False Positive Rate because higher FPR means your classifier would be more likely to mistakenly filter out non-spam emails (i.e. important emails from your boss). In that case, you need to tailor your own decision function, for example, by giving FPR more weight.

A malicious URL classifier would prefer higher recall(TPR), simply because you don’t want to let bad URL pass by – even that higher recall usually comes with higher FPR. It is a matter of trade-off.