Reliably evaluating machine learning models offline before pushing them into production is fundamental for the success of any data science initiative.
A robust offline evaluation phase ensures that the production model will offer value and quality to end users without unexpected behaviors and bias. Offline evaluation also provides the basis for making educated forecasts on the expected return on investment (ROI).
The most common failure point when performing offline evaluation is to test the model on historical data which is not representative of the online data that the model will predict on at serving time. Cross-validation is a technique that aims to minimize this risk while also providing a simple and accurate model performance evaluation process. This is especially valuable when using small datasets.
This article presents an introduction to cross-validation, an overview of its benefits, and a walk through of when and how to use this technique in machine learning.
Cross-validation is a statistical technique used by data scientists for training and evaluating a machine learning model, with a focus on ensuring reliable model performance.
To understand how cross-validation supports model development, we first need to understand how data is used to train and evaluate models.
After performing data processing and feature engineering, it is typical to divide the available historical data set into three different splits:
Each data sample should appear in only one of the sets to avoid information leakage, and each data set should be of a statistically significant size. Most commonly, a 70/10/20 split is performed for large datasets, with the majority of data allocated to the training set.
When not performing tuning, practitioners often prefer not to use a validation set so that they can use all available data for training and evaluation. In this approach, called the holdout approach, only training and test sets are used with data samples split randomly between the two.
The cross-validation process builds upon the standard data splitting approaches outlined above. It involves feeding data for the training and evaluation of the machine learning algorithm using a new two-step approach:
The results are then averaged across these multiple training and evaluation runs to determine the model’s overall performance.
Thus, cross-validation reduces the risk that is inherent in the holdout approach of having a particularly lucky or unlucky split of data for training and testing. This is less likely in cross-validation since it provides more than one data split.
When referring to the cross-validation technique, it is common to also use the terms out-of-sample testing or rotation estimation, both which allude to its underlying process.
While some also use k-fold cross-validation as a synonym, this is actually a specific implementation of the cross-validation method where k is equal to the number of folds.
Many data scientists would agree that cross-validation is one of the most helpful techniques to use when developing machine learning, as it results in models with high degrees of accuracy and reliability. Cross-validation has the additional benefit of being easy to understand and implement.
The advantages of using cross-validation in machine learning stem from the fact that the technique provides multiple training and test data splits. This allows data scientists to:
These benefits are particularly relevant for small datasets as all available data is used both for training and evaluation.
Perhaps the most important downside worth noting is that this technique is necessarily expensive in both time and compute resources, because it runs training and evaluation multiple times.
While cross-validation in machine learning can be applied to a variety of learning tasks, it is most commonly used for supervised learning.
As long as sufficient time and compute resources are available—which is often the case when doing cross-validation for small datasets—and data can be randomly split in folds, you can use this technique when prototyping any machine learning model to:
Also, specific implementations of cross-validation techniques exist in order to:
A large variety of machine learning validation techniques and implementations are available to allow practitioners to begin performing cross-validation immediately.
Cross-validation techniques can be divided into:
Exhaustive approaches should be used for small datasets only, as it can be extremely expensive to run model training and evaluation for all data permutations.
The most commonly used cross-validation implementations for these techniques (and more besides) are provided by the sklearn library as well as popular frameworks such as xgboost and lightgbm; most data science libraries have built-in capabilities for model validation.
When implementing and running cross-validation as well as more generally prototyping machine learning models, it is fundamental to have processes set up for experiment tracking in order to avoid repetition, reduce human error, and provide full traceability.
MLRun is a powerful tool for experiment tracking with any framework, library, and use case—including cross-validation—which can be deployed as part of Iguazio’s MLOps platform for end-to-end data science.