A Quick Look at Classification Metrics

Richard Mei
3 min readOct 5, 2020

--

Regression problems use metrics like MAE, MSE, and RMSE. In terms of classification problems, classification model’s metrics want to evaluated based on a few different terms:

  • Accuracy: overall, how often is this classifier correct?
  • Precision: when there is a positive result, how many of them are actually truly positive over total reported positives. Use this when you want to be more confident in true positives rather than false positives.
  • Sensitivity (Recall): proportion of actual positives identified correctly. You can think of all the times there is actually cases, how many of the true positives did one get. If you are okay with false positives more, then choose this metric. (Prefer False Positive over False Negative)
  • Specificity: portion of actual negative identified correctly. I usually think of recall, but for identifying the negative cases. If you are okay with false negatives more, then use this metric. (Prefer False Negatives over False Positive)

Usually to remember this, I like to use the story of the boy who cried wolf and we have to understand that we are looking at the boy’s point of view like how the model would be telling us the results. If we first look at if the time that there actually is a wolf, then the recall would be of the times there is a wolf, how many times did the boy cried wolf. Specificity would be the times that there is no wolf , how many times did the boy not cry wolf. It’s not a perfect example, but for some reason it helps me remember, so hopefully it’ll help you as well.

The metric you choose depends on what you are testing for, so not one metric is a better one. Generally, if the classes are symmetrical (balanced) and false positives and negatives are equally weighted, then using accuracy is a fine metric. On the other hand, if there is imbalance or you prefer a false positive/negative over another, then you would not want accuracy, you want one of the others. In this case, another good metric is the F1 score since it’s harmonic average of precision and recall. Below is a good chart for review:

Relationship of Precision and Recall:

Now that we know what the precision and recall is, it’s time to go a little more in depth. There is a ‘tug of war’ between precision and recall, so if you aim to lower or increase one, the other will adjust inversely. The adjustment of the two metrics can be done so by changing the threshold, the value you decide that becomes the threshold value for almost drawing the line of classification. For example, if you are doing a logistic regression and you have a data points where you have 100 negatives and 10 positives, you can adjust the threshold of identifying so it may change to 95 Negs and 15 pos for your prediction. This would change you precision and recall inversely like stated above.

You may be asking how do you find the right balance? There is a way to actually determine the effects of precision and recall without having to actually change the threshold and that is by looking at the ROC (Receiver Operator Characteristic) and AUC (Area under the ROC curve).

  • ROC graph of sensitivity vs False positive rate. This graph summarizes all of the confusion matrices that each threshold produces
  • AUC helps comparing the curves between two different ROC curves. You want the area under the curve to be closer to 1 or you want the higher AUC.

Overall, that was a quick look at some classification methods hopefully this served as a good introduction!

--

--