Overfitting in ML models

Overfitting in ML models

What level of overfitting is acceptable? How to avoid it? Why to avoid it?

Overfitting happens when a model is more complex than it should be and starts to fit the noise in the data (or some degree of that) instead of the underlying pattern. Overfitting leads to poor model performance on new and unseen data.

While there is no single answer for what level of overfitting is acceptable (although we know it partially depends on the specific application, goals, and tradeoffs), a common approach is to use a validation data set to monitor the model’s performance during training. When the performance of the model on the validation set starts to degrade while its performance on the training set continues to improve (typically presented as a U-shaped curve) we are transitioning between under- and overfitting regimes. At this point, you can stop training the model or use techniques like regularization or data augmentation to prevent overfitting. While overfitting leads to poor generalization (beyond the training set) and reduces the predictive power of the model, almost always, some level of overfitting is expected and acceptable.

Avoiding overfitting

Few techniques are known to mitigate overfitting.

  • Regularization: A very well-known technique that helps prevent overfitting by adding a penalty term to the model's cost function. Regularization discourages overly complex or extreme parameter values. Subsampling rows/columns in XGBoost or ML models with similar architectures is another form of regularization.

  • Model Simplification: Use a simple model but not simpler. A simple model has fewer parameters and less possibility to fit noise. Also ensure to discard less discriminatory features and only keep the most relevant ones.

  • Knowledge Distillation: Train a smaller, more regularized model to mimic the behavior of a larger, more complex model. This helps in efficient knowledge transfer from the complex model to the simpler one.

  • Cross-Validation: While its primary purpose is not to prevent overfitting, it provides valuable insights into how well a model is likely to perform on unseen data and can indirectly help in preventing overfitting.

  • Early Stopping: Stop model training once its performance on validation set plateaus or starts to worsen.

  • Data Augmentation: This includes not only collecting more diverse and representative data but also Introducing random noise into the input data (to make the model more robust and prevent it from relying too much on specific patterns in the data).

As a rule of thumb, I would not be comfortable deploying an ML model whose performance drops from train to test by more than 2-3%. If that is the case ensure to use any of the techniques listed above to alleviate the situation.

Benign overfitting

A very related phenomenon in machine learning is what is known as "benign overfitting", where a model is overfitting the training data but still performs well on the test data. These models have great expressive power. While the performance of these models might be very good concerning testing data, these models might be very sensitive to small perturbations in the inputs; thus, very fragile in production.

Benign overfitting can be the consequence of Abundance of Data or Robust Model Architecture to name a few. While benign overfitting suggests that the model can handle the noise with no significant degradation in its ability to make accurate predictions on new, unseen data, we need to ensure models generalize well without relying on spurious patterns in the data.

While overfitting does not necessarily harm generalization, it negatively impacts model sensitivity and robustness.

In case you need to get a bit more practical in selecting the right model with reasonable performance and robustness you need to look into PiML (PiML GitHub).

pip install PiML

PiML for Identifying model weaknesses before putting them into production

Well-tuned models tend to be more robust in production when compared to overtly fit and underfit models. That holds true even when we have an overfit/underfit model with similar performance on the test data as a well-tuned model. So once you've gotten to a reasonable threshold of model performance, you're better off focusing on model stability, model generalizability, and the policies you'll put in place around the model than on over-optimizing the model itself. It's the combination of the model + the decision strategies around it that will yield meaningful results in lending. So Identify model weaknesses before putting them into production.

Let's have a quick look into the impact of benign overfitting on model robustness using PiML.

The interplay between model robustness and the quality of fit

A robust model can effectively manage variations, noise, or outliers in the input data without experiencing a substantial drop in performance. Experimental studies show that the performance of a well-tuned model is statistically significantly less affected by perturbations in input data when compared to that of overfit models.

Interested to read more on this topic? Continue here.

Avoid overfitting in DNNs

In part 2 of the same series, I'll cover techniques used to avoid overfitting in DNNs.

What's next? Markov Chain Monte Carlo for approximation of the posterior distribution: Metropolis-Hastings vs Hamiltonian Monte Carlo (Read here)