• Artificial Intelligence
• Machine learning

# Regularization in Machine Learning: Reducing Errors and Biases in ML Models

Eugene Dorfman
7 May 2022
9 min Your robotic vacuum cleaner uses it, your iPhone uses it. Every Google query you do is processed by the algorithm. Machine learning is everywhere in 2022 — and it’s often inaccurate.

One of the most infamous examples of inaccuracies within machine learning is the COMPAS algorithm — the algorithm behind it was flawed, prone to racial bias, and it’s only one of the reasons why it was called “no more accurate than the average person’s guess.” Most common cases of ML misbehavior happen, e.g., when an algorithm is unable to recognize an object that isn’t in the position algorithm is “used” to seeing. In a study by Adobe, a neural network knows what a school bus looks like in its usual position — but if it’s positioned diagonally in the photo, the bus suddenly transforms into a punching bag.

In this article, we will talk about what is regularization in data science and machine learning and dive more into techniques that help us reduce errors in predictive algorithms (and not just them).

Want to know more about tech trends?

Success!
Thank you

## How predictions are made and how errors occur in machine learning

But first, let’s figure out what models data scientists and machine learning engineers use to predict things and what prevents them from doing it accurately.

### In linear regression

To work with linear data, data scientists use linear regression. It helps to predict things: CV acceptance rates, the probability of a patient being admitted to the hospital, revenue growth, income, etc. Here’s our regression:

Here, Y is the learned relation — a value the model needs to predict.

To fit a model that accurately predicts the value of Y, we require a loss function (that will measure the inaccuracy of the model) and optimized parameters (bias and weights). We’ll use a residual sum of squares to figure out this value:

Now, this model aims to optimize the coefficients — weights — based on the training data set. If the real-world data we’ll feed it is noisy, it won’t be able to predict things well, because it won’t be able to make sense out of the noise and, at the same time, won’t be able to stop inspecting the noise looking for that sense. “Noisy” means “with irrelevant, non-representative data.” Like if you’re trying to make a calculation that would allow you to forecast cost per customer acquisition, but the model is constantly thrown out by the new year sales data in your annual report — because the training set didn’t have these holiday-related fluctuations because, for instance, the database has been cleaned out of extreme events.

### In polynomial regression

Now, if we have non-linear data or want to achieve more accurate predictions within large sets of linear data to predict things, we use polynomial regression. Such models can be divided according to the number of the highest order. Here are what polynomial functions look like:

• a simple linear regression equation

The following figure shows the polynomial regression algorithm to solve the men’s 100-meter freestyle problem in the Olympic Games (they’ve tried to predict winning time on the basis of data from previous competitions.)

The above functions show that the training data loss decreases as the model order increases, but the test data loss increases as the model order increases. With the complexity of the model, the degree of overfitting increases as well.

Overfitting is what we’ve described in the cost of acquisition example and what we see on the eight order polynomial model above — it’s when the model captures and reacts to absolutely every detail in the dataset. Then, as you can see from these graphs and from our previous explanation when it’s unable to find the same patterns and behaviors in the real-world data, it’s off the rails.

There’s also underfitting — when the model assumes too much, and its overgeneralization produces biased results. It can happen due to a lack of data, or neglect of variants by an algorithm, or in case the algorithm isn’t complex enough to create a model that’s able to generalize accurately.

## How does regularization work in machine learning

What is regularization in machine learning? The regularization concept was introduced in machine learning to give additional training and rules to the algorithm that increase the accuracy of the model.

Regularization makes algorithms less prone to errors by creating the best fit for the function on a given dataset. It often means keeping the same number of variables but reducing the magnitude of high coefficients — remember we’ve talked about how we hung up our linear regression model on what coefficients describe?

Regularization is basically what penalizes the model for adding too much weight or explanatory power to the features it’s trying — or for taking too many features into account. For overfitting, regularization techniques say, “this is too specific; we need to be able to draw a broader picture.”

There are four main regularization techniques:

• L1;
• L2;
• Dropout;
• Early Stopping.

For underfitting, they say, “this is way too complicated! how did you come to that conclusion?” And they basically do not allow the model to go into either of these extremes.

Now, let’s talk about what is l1 and l2 regularization in machine learning.

### Regularization via lasso regression (L1 Norm)

Let’s return to our linear regression model and apply the L1 Regularization technique. Lasso regression helps us automate certain parts of model selection like variable selection — it will stop the model from overanalyzing everything it sees.

L1 regularization will modify our RSS by adding shrinkage quantity — punishment — to the sum of the absolute value of coefficients. Lasso method’s loss function will deal with absolute coefficients only:

Shrinkage quantity — lasso constraint — will create limits for absolute values of coefficients and penalize them if they get too high. This technique is also known as the L1 norm.

#### When to use lasso regression

Lasso Regression is most commonly used if you have many features in datasets and want to work with only a few values selected. With L1, the algorithm will focus on them only — for the unnecessary values, lasso coefficients will be reduced to zero. It’s used to reduce dimensions in the set and estimate medians of data. It works well when the model already has features that stand out and you want to pick on them: it will work nicely with our cost per acquisition example.

### Regularization via ridge regression (L2 Norm)

This regularization technique will, on other hand, instead of penalizing the absolute values of the coefficients will penalize the square of the coefficients’ magnitude. The loss function will look like that:

Parameter α is a tuning parameter here — through adjusting it, we’ll decide how much we’ll want to penalize our model either by minimizing the sum of square coefficients or by reducing RSS. Within ridge regression coefficient estimates cannot be removed from the model or ignored.

#### When to use ridge regression

L2 Regularization works well when your datasets have multicollinearity — when there are several variables correlated. L1 regression tends to overlook the significance of predictors that are correlated, which poses an issue if you’re trying to use data analytics with large datasets where every variable is significant. Ridge Regression in machine learning makes a model less complex — it’s good for making long-term predictions and forecasting general trends.

### Regularization via dropout

Dropout is the third most popular technique of reducing overfitting that is used exclusively in the development of neural networks. Using it means ignoring some nodes in a given layer of training data — it turns some neurons in the network off so they a) aren’t learning anything, b) aren’t adding anything to a network. It helps generalize the data by learning more robust internal constraint functions. A random chance decides what nodes are dropped.

The analogy here would be that a network can avoid overfitting if fewer nodes are trying to figure out what’s happening — dropout makes the model “listen” to a single node in the layer and learn general patterns more efficiently.

#### When to use dropout for regularization

A dropout technique is a tool that’s used for regularization in neural networks (particularly if you don’t have enough training data for them.)

### Regularization via early stopping

The main principle of Early Stopping is to, yes, stop when the model starts to overfit. It’s done with training and validation sets of data — the amount of iteration the model needs to set a balance between how the model performs until it overfits (within the training set) and how the model performs until its accuracy decreases (within the validation set) can be considered a hyperparameter that data scientist needs to adjust.

#### When to use early stopping

This technique prevents overfitting and can help discover the number of iterations that can be conducted before issues occur. Good for deep learning.

## Regularization improves machine learning models’ performance

Regularization in machine learning algorithms optimizes your algorithm and makes it more accurate. Considering the fact the accuracy of ML algorithms is something quite a lot this depends on, they need to have high performance — or provide clear actionable conclusions.

Overfitting and underfitting can become a pain not only for predictions, though. Face recognition tools that unlock our screens, wearables that track our vitals (and can call a doctor if something is wrong), voice recognition tech — all of these use machine learning, and all of these are prone to lack or abundance of generalization.