The machine learning lifecycle is an iterative, multidirectional process composed of three main phases:
In this lifecycle, the second phase is the most experimental. Here, data scientists perform feature engineering to ensure that the collected raw data is transformed in its best representation for model learning. Model training can then begin.
Feature engineering and model training are intertwined and iterative, but model training can be seen as the pivotal step of the machine learning model development process. This is because feature engineering ultimately aims to enable the most effective performance for model training.
This article presents an introduction to model training, a discussion of its importance, a walk-through of how to train machine learning (ML) models during experimentation, and a conclusion on productionizing model retraining.
Model training is the process of feeding engineered data to a parametrized machine learning algorithm in order to output a model with optimal learned trainable parameters that minimize an objective function.
Let’s dissect the different parts of this definition:
Model training happens in multiple consecutive iterations whereby the training data, divided into batches of typical size between 32 and 1024, are fed multiple times to the algorithm. This allows the algorithm to learn the data’s underlying patterns.
Machine learning is a discipline at the intersection of artificial intelligence and computer science. We use terminology and concepts from the latter to understand ML.
Model training aims to build the best mathematical representation of the relationship between data and a target (supervised) or among the data itself (unsupervised).
Metrics such as accuracy define how well the model has learned this representation, i.e. they report the model’s performance. The better the model performance, the more benefits using the model in real life will bring. These benefits could include increased revenue, reduced costs, or improved user experience.
Investing time and resources for optimal model training means having access to the right expertise and an appropriate engineering backbone setup within a production-first approach to ML. Such an investment can prove a real differentiator for business success. In fact, leading ML-driven businesses achieve 44% higher productivity and 40% better customer experience—among other gains—than their counterparts.
The process of training ML models can be divided into four steps.
The training data set is used for model training, and the evaluation set for performance evaluation of the trained model. It is essential that these sets do not intersect and that data in the evaluation sets has not been seen during training in order to ensure an unbiased performance estimate.
First, we should select a simpler algorithm than our model’s, or a heuristic, to use as a baseline to compare the final trained model’s performance against.
Then, it is common to select multiple algorithms for training, unless one specific algorithm is clearly the best fit for the use case and data. The most appropriate algorithm(s) to deploy is dependent upon training and inference speed, costs, data size and type, available infrastructure, and desired offline performance.
Some of the most common machine modeling techniques are:
* For deep learning, there is a follow-up phase of “model architecture development” to define the exact layers—optionally on top of pretrained networks—to be used for the final neural network model.
Each algorithm has a set of default hyperparameters, which is unlikely to be the most performant for any use case and data. We perform hyperparameter tuning on a data subset before training the final model on the complete data set to maximize the performance from each algorithm.
We should also provide a validation set when performing model tuning for evaluation with different hyperparameter selections so as to keep the evaluation set unseen for the final model evaluation.
This is the process of fitting the training data to the tuned algorithm.
The end-to-end model training is a highly experimental process that requires many iterations. For each selected algorithm, we can expect to repeat steps 3 and 4 multiple times, and to update frequently the feature set provided as input. This is why having a robust and user-friendly experiment tracking process that ensures a systematic, repeatable, and reproducible process—like MLRun—is so important for the success of data science initiatives.
In production, we can expect to want to retrain the model periodically as new data comes in to minimize the chances of concept drift. The model retraining is best automated to run on a schedule, and possibly trigger, within the monitored end-to-end production system.
A production-first approach aims to develop the infrastructure for the complete model lifecycle first, and push models into production fast within an agile process. This kind of approach can accelerate the end-to-end data science process up to x12! No surprise, then, that a production-first approach is the new paradigm for model training and prototyping.