Recipe 9: Beating the Baseline Model

You’ve built your first model, now you’re wondering how do you make it better … In this recipe, we’ll outline some useful techniques to beat it.

(Note: moved metrics so section 5, where we explore the data & problem - metrics should have been chosen before we actually train a model)

How to beat your model?

In order to improve from your baseline results, the following options are available: choose a new model type, reevaluate your data (e.g. feature engineering, transformations, etc), tune your hyperparameters, and transfer learning (for deep learning models). The only solution that is not an option is training your model on your test data. So how do you know what you should do?

New Model Type

If you’re far away from achieving your accuracy metric, the complexity or assumptions built into your model may not match those of the data. In this case, you may have to increase the complexity of your model or look into different model types. For instance, if you built a baseline classification model using decision trees, you could up your game by using random forests, or optimal classification trees. However, increasing the complexity often means decreasing the interpretability of your model. Additionally, you should increasing the amount of your training data with model complexity. If your model uses too much memory or compute, you may have to decrease the complexity of your model (likely at the cost of accuracy). For instance if you built a baseline model using many neural network layers, you likely will want to reduce the number of layers in the model.

Another way to increase your model complexity is by using ensemble models. The ensembling method combines multiple models (often each with different strengths) to obtain better predictive performance. This can be done by pooling in the predictions (via averaging or voting) or feeding the predictions into a meta-model which learns to predict a better output (stacking). Here are more details on the different methods and here is an example of stacking. Generally, Kaggle competition winners end up using some type of ensemble learning to boost their performance. Note that ensembling comes at the cost of interpretability and ease of deploying.

Hyperparameter tuning

You may be able to reach your goal just by tuning your hyperparameters. These include decision thresholds, learning rates, number of training iterations for NNs, penalties in L1 or L2 regularization, C and sigma for SVMs, k for k-nearest neighbors, etc. In many Kaggle competitions, hyperparmeter tuning meant the difference between a top tier model and a mid-level model. This is especially true for neural networks, where the loss landscape has many local minima. Sometimes your problem might have a very narrow solution space, so tuning your hyperparameters can help your model learn to find the solution.

The simplest and most common method for hyperparameter tuning is a grid search. There are many libraries that can help you implement this and you can define your search space based on your available time and compute. One good starting point is by looking for hyperparameter values used in similar, state-of-the-art models from academia or even Kaggle. For example, you may find that a neural net with a similar task to yours uses a learning rate equal to 0.03. In that case, you can define your hyperparameter search space around the 0.03 value.

Note that hyperparameter tuning should be done on a seperate validation set, not test set, to avoid overfitting. Ideally you will have enough data to split into training, validation, and test sets but, if not, you can also use cross validation.

Feedback loop with your training data

If you’re receiving inconsistent results, you may need to work more on feature engineering and add (or remove) transformations, normalizations, etc. (see Recipe 4). It always helps to re-evaluate your training data in light of the accuracy metric you chose (assuming the metric is a correct proxy for your questions). This feedback loop is an efficient way to check for problems with your data. For example, check which model outputs give you the worst loss values and their respective inputs. It’s very possible you will find basic data transformations / feature engineering that you missed. Reincorporate the changes you made and retrain your model.

Additionally, you may find an underrepresented subgroup in your data that yields incorrect results. For example, many early facial recognition algorithms suffered from bias against underrepresented racial groups because their training data consisted primarily of caucasians. In that case, you may want to go back to your sources and try to fetch additional data from that subgroup. The distribution of loss/accuracy values across your training inputs can tell you a lot about your data (and what you should do about it).

After Deploying: Feedback loop with real-world data

Even if your model has good results on your initial test set, the unchanged model will ussually degrade in performance after deploying (for deploying, see Recipe 10). This is known as dataset shift (or distribution shift) and occurs when the training and later test distributions are different.

Why does this occur? Some main reasons for this shift are: sample selection bias (training data not indicative of true distribution to begin with), or non-stationary elements (assumptions that change over time/space). For example: you train a model to predict life expectancy using data from healthy people, but then deploy to people who are heavy smokers, or you train a model to predict profits using pre-2007 financial information and deploy after the 2007-2008 crisis.
How can you measure the change? A good sanity check is evaluating the model performance over time: you can regurlarly label a sample of the future real-world data and compare it to the outputs of your model. However, if you don’t have the luxury of labeling future data, you can compare the distribution of outputs to prior distributions (i.e., statistical distance). For example, is the histogram of your outputs significantly different from the histogram of prior outputs? For more in-depth methods, check out this article.
How can you incorporate change? Continue to update your training dataset! Similar to the feedback loop with training data, look for underrepresented data samples. If you find data points that your model fails on, seek out more of those data points and update your dataset. For example, check out this talk from Andrej Karpathy, explaining how and why Tesla AI continuosly incorporates data from deployed cameras to update their neural network.

See Recipe 10 for more practical details about automating this feedback loop.

Additionaly, you should continue feature engineering. Some features will be time-dependent, so as time goes on you may have update (or even remove) those features based on the insights you get from new model failures. You can also reweigh the importance of certain datapoints. For a time-series example, if you know which datapoints are closer to the real-world distribution you can reweigh those datapoints as more important to your model.

Recipe for training neural networks

This blog post is a good recipe for training and optimizing neural networks from Andrej Karpathy, Director of Tesla AI.

Transfer Learning

Transfer learning is the process of using a pre-trained model and fine-tuning it on your specific data. For example, if you need to classify cats and dogs you can transfer a model trained on ImageNet (one of the biggest, most famous image data benchmarks) rather than starting from scratch. The main idea behind transfer learning is that the higher-level information (e.g., shapes and edges) is stored in the network’s earlier layers. So we can just copy those layers and retrain the final layers to fine-tune to our problem.

With transfer learning, you get the benefit of a large, pre-trained, state-of-the-art model from big AI companies like OpenAI or Google that cost millions of dollars to create … for almost nothing! At the very least, a transfered model is a good starting point for initializing your own model (as oposed to random initialization) and, at best, it can solve your problem with less compute and less labled data. Note that this method is for deep learning models and your model needs to share the same (or very similar) architecture with the transfered one. Google shares pre-trained models at TensorFlow Hub and the FastAI library also makes them easily accessible. HuggingFace has great pre-trained NLP models.

Useful Resources:

“A Gentle Introduction to k-fold Cross-Validation” (Machine Learning Mastery)
“Understanding dataset shift “(TDS)
“Dataset Shift in Machine Learning” book (MIT Press)

» Previous recipe
» Next recipe

» About
» Ingredients
» All recipes
» Resources
» Code examples