Correctly Validating Machine Learning Models

Calculating model accuracy is a critical part of any machine learning project, yet many data science tools make it difficult or impossible to assess the true accuracy of a model. Often tools only validate the model selection itself, not what happens around the selection. Or worse, they don’t support tried and true techniques like cross-validation.

All data scientists have been in a situation where you think a machine learning model will do a great job of predicting something, but once it’s in production, it doesn’t perform as well as expected. In the best case, this is only an annoying waste of your time. But in the worst case, a model performing unexpectedly can cost millions of dollars – or potentially even human lives!

So was the predictive model wrong in those cases? Possibly. But often it is not the model which is wrong, but how the model was validated. A wrong validation leads to over-optimistic expectations of what will happen in production.

Basics of Predictive Models and Validation

Before we dive deeper into the components of a correct validation without any surprises, this section describes the basic concepts of the validation of machine learning models.

What Models Can Be Validated?

A predictive model is a function which maps a given set of values for the x-columns to the correct corresponding value for the y-column. Finding such a function for a given data set is called training the model.

Good models are not only avoiding errors for x-values they already know, but, in particular, they are also able to create predictions for situations which are only somewhat similar to the situations which are already stored in the existing data table. This ability to generalize from known situations to unknown future situations is the reason we call this particular type of model predictive.

Training vs. Test Error

The one thing true for all machine learning methods, whether it is a decision tree or deep learning: you want to know how well your model will perform. You do this by measuring its accuracy.

Why? First of all, because measuring a model’s accuracy can guide you to select the best-performing algorithm for it and fine-tune its parameters so that your model becomes more accurate. But most importantly, you will need to know how well the model performs before you use it in production. If your application requires the model to be correct for more than 90% of all predictions but it only delivers correct predictions 80% of the time, you might not want the model to go into production at all.

So how can we calculate the accuracy of a model? The basic idea is that you can train a predictive model on a given dataset and then use that underlying function on data points where you already know the value for y.

It is now relatively easy to calculate how often our predictions are wrong by comparing the predictions in p to the true values in y – this is called the classification error.

There are two important concepts used in machine learning: the training error and the test error.

  • Training error. We get this by calculating the classifcation error of a model on the same data the model was trained on
  • Test error. We get this by using two completely disjoint datasets: one to train the model and the other to calculate the classification error. Both datasets need to have values for y. The first dataset is called training data and the second, test data.

Next let’s look at the process to calculate the test error. It will soon be apparent why it is so important that the datasets to calculate the test error are completely disjoint (i.e., no data point used in the training data should be part of the test data and vice versa).

Cross-Validation as the Gold Standard

Using a hold-out data set from your training data in order to calculate the test data is an excellent way to get a much more reliable estimation on the future accuracy of a model. But still there is a problem: how do we know that the hold-out set was not particularly easy for the model? It could be that the random sample you selected is not so random after all, especially if you only have small training data sets available. You might end up with all the tough data rows for building the model and the easy ones for testing – or the other way round. In both cases your test error might be less representative of the model accuracy than you think.

One idea might be to just repeat the sampling of a hold-out set multiple times and use different samples each time for the hold-out set. For example, you might create 10 different hold-out sets and 10 different models on the remaining training data sets. And in the end you can just average those 10 different test errors and will end up with a better estimate which is less dependent on the actual sample of the test set. This procedure has a name – repeated hold-out testing. It was the standard way of validating models for some time, but nowadays it has been replaced by a different approach.

Although in principle the averaged test errors on the repeated hold-out sets are superior to a single test error on any particular test set, it still has one drawback: we will end up with some data rows being used in multiple of the test sets while other rows have not been used for testing at all. As a consequence, the errors you make on those repeated rows have a higher impact on the test error which is just another form of a bad selection bias. Hmm… what’s a good data scientist to do?

The answer: k-fold cross-validation

With k-fold cross-validation you aren’t just creating multiple test samples repeatedly, but are dividing the complete dataset you have into k disjoint parts of the same size. You then train k different models on k-1 parts each while you test those models always on the remaining part of data. If you do this for all k parts exactly once, you ensure that you use every data row equally often for training and exactly once for testing. And you still end up with k test errors similar to the repeated holdout set discussed above.


  • In machine learning, training a predictive model means finding a function which maps a set of values x to a value y.
  • We can calculate how well such a model is doing by comparing the predicted values with the true values for y.
  • If we apply the model on the data it was trained on, we can calculate the training error.
  • If we calculate the error on data which was unknown in the training phase, we can calculate the test error.
  • You should never use the training error for estimating how well a model will perform. In fact, it is better to not care at all about the training error.
  • You can always build a hold-out set of your data not used for training in order to calculate the much more reliable test error Cross-validation is a perfect way to make full use of your data without leaking information into the training phase. It should be your standard approach for validating any predictive model.