Accuracy is the archetypal metric in machine learning.
It’s a metric for evaluating model performance for classification tasks and is so well-known that it is often used as a synonym for overall offline and online model performance.
Accuracy achieved this unique status as it is one of the—if not the—easiest metrics to interpret and implement in ML. It provides a clear answer, appreciated by all stakeholders, to the question “How often is the classifier correct?” This simplicity, however, comes at the cost of only being applicable to limited use cases.
In this post, we’ll dive into the details of this crucial metric by providing a comprehensive definition, a review of its importance, and a walk-through of when to use it, and when not to, for evaluating model accuracy.
AI accuracy is the percentage of correct classifications that a trained machine learning model achieves, i.e., the number of correct predictions divided by the total number of predictions across all classes. It is often abbreviated as ACC.
ACC is reported as a value between [0,1] or [0, 100], depending on the chosen scale. Accuracy of 0 means the classifier always predicts the wrong label, whereas accuracy of 1, or 100, means that it always predicts the correct label.
A nice characteristic of this metric is that it has a direct relationship with all values of the confusion matrix. These are the four pillars of supervised machine learning evaluation: true positives, false positives, true negatives, and false negatives.
Starting from the confusion matrix, we can see this relationship by deriving the statistical formula for accuracy. Note that we do so on binary classification for simplicity, but the same concept can be easily extended to more than two classes.
Accuracy is a proportional measure of the number of correct predictions over all predictions.
Correct predictions are composed of true positives (TP) and true negatives (TN).
All predictions are composed of the entirety of positive (P) and negative (N) examples.
P is composed of TP and false positives (FP), and N is composed of TN and false negatives (FN).
Thus, we can define accuracy as ACC =TP + TNTP + TN + FN + TP =TP + TNP + N.
It is important to also emphasize that evaluating model accuracy should be done on a statistically significant number of predictions as per any metric evaluation.
One of the main reasons why model accuracy is an important metric, as previously highlighted, is that it is an extremely simple indicator of model performance. We have not mentioned yet that it is also a simple measure of model error. In fact, accuracy can be seen as (1 – error).
In both of its forms, accuracy is a particularly efficient and effective metric to evaluate machine learning prediction accuracy. It is one of the most used metrics in research, where it is common to have clean and balanced datasets to allow for focus on advancements in the algorithmic approach.
Accuracy can be useful for real-life applications too, when datasets with similar characteristics are available. Thanks to its clear interpretation, model accuracy can be easily matched with a variety of business metrics such as revenue and cost; this simplicity makes it easier to report on the value of the model to all stakeholders, which improves the chances of success for an ML initiative.
Accuracy is a good metric for balanced classification tasks. A classification task is balanced when all classes appear in comparably equal numbers. The easiest way to understand why this is the case is to look at an example of an imbalanced classification task.
Let’s take a fraud engine with a 0.03% fraud rate that needs to flag transactions as fraudulent or not fraudulent. If all transactions are classified as not fraudulent, prediction accuracy is 99.97%. Model performance seems to be almost perfect, but the classifier is actually useless, as it does not flag any real fraudulent transaction.
For imbalanced tasks, alternative metrics can be used such as precision and recall. These metrics are also derived from the confusion matrix, respectively as TPTP + FP and TPTP + FN. The two metrics are reciprocal in the sense that improving one reduces the other.
It is thus common to look at metrics that combine them, particularly their harmonic mean, called F1-score, which is often referred to as the “accuracy for imbalanced classes.” We recommend looking at “A Review on Evaluation Metrics for Data Classification Evaluations” by Hossin and M.N. for a general overview of classification metrics.
Accuracy has a direct effect on deployment too, or online performance.
Machine learning prediction accuracy aims to give a good idea of how well a model performs at predicting on unseen data samples. If a model achieves higher-than-threshold offline performance, then it can be safely deployed.
It is often the case that a model’s online performance changes over time as the behavior underlying the data itself evolves. Also, different from offline performance evaluation, measuring the performance of a deployed model requires accommodating for a lag since labels are not immediately available on live inputs.
In the few cases when labels are available in a timely manner, online evaluation is typically set up as a recurrent offline evaluation of the latest labeled live data, with relevant metrics reported in dashboards and alerts set up. Most often, online data is difficult to label, and the online evaluation of the model’s accuracy is measured via statistical metrics to catch drift.
Continuously keeping track of accuracy for the online model within your monitoring system is fundamental to detecting and acting on model staleness, as well as data and concept drift, and continuously ensuring optimized model performance.
While it seems like the ideal goal would be to achieve 100% model accuracy when developing a model, getting this result is not something to look forward to.
Achieving 100% machine learning model accuracy is typically a sign of some error, such as overfitting; that is, the model learns the characteristics of the training set so specifically that it cannot generalize to unseen data in the validation and evaluation sets. It can also be a sign of a logical bug or data leakage, which is when the feature set contains information about the label that should not be present as unavailable at prediction time.
This is because it is practically impossible to have perfect prediction machine accuracy given that complex systems are inherently underdetermined. If you can fully determine the system and achieve 100% accuracy, then you should not be using machine learning. Instead, you should consider classical modeling or a heuristic.
While it may be impossible to achieve 100% accuracy, understanding what AI accuracy represents, together with how and when to use it as a metric, can make a real difference in making your machine learning initiatives successful.
In fact, we recommend adopting it as one of the evaluation metrics for any initiative that can be modeled as a balanced classification task.
When tracking metrics for offline experiments and online evaluation, Iguazio brings your data science to life with a production-first approach that can boost your model accuracy throughout the model lifecycle. This ensures it will keep performing via automated model monitoring, automatic training, and evaluation pipelines.