[ML Review] Ridge Regression & Regularization

Introduction

In the previous post, the Least Squares Regression (LSR) is reviewed. But the last shown example overfits when the frequency becomes higher (the basis function maps input points into higher dimensional feature space). This post will review a method called Ridge Regression against such overfitting. Furthermore, the concept of regularization will also be introduced.

Ridge Regression

Before formally introduction to the Ridge Regression, let’s firstly think about, why overfitting happens in LSR. In our previous example, when frequency is $ 1 $, the trained function is not “flexible” enough. It can only express one pair of “wave crest” and “wave trough” within our training data’s range. Because of such limitation, the optimization have to compromise. Calculate a minimum error and getting a “trend description” as the final result.

When the frequency increases to a larger number, the result functions become quite flexible. There are enough local “crest” and “trough” to fit the training points, which lead to local “trembling”, or in other words, overfitting. Our goal is now clear, we need to suppress such local “trembling” and keep its generally trend.

In our error function: $ E(\text{w}) = \left|\left| t - \text{w}^T \text{X} \right|\right|^2 $, with the increasement of basis function mapped feature space’s dimensional, the only changed thing is the weighting vector $ \text{w} $’s dimension. If we penalize $ \text{w} $’s size (number of dimension), results would be better: $ E_{ridge} (\text{w}) = \left|\left| t - \text{w}^T \text{X} \right|\right|^2 + \lambda \left|\left| \text{w} \right|\right|^2 $. Given a suitable $ \lambda $ (scalar, talk about its value selection later), higher $ \text{w} $ size will be penalized and “trembling” will be suppressed.

Not quite intuitive? Assume that $ \text{w}_{opt} $ is the optimized result from LSR error function $ E(\text{w}) $. But it could not be the minimum-value-point in Ridge Regression error function $ E_{ridge} (\text{w}) $ because of the insertion of $ \lambda \left|\left| \text{w}_{opt} \right|\right|^2 $. Therefore to achieve the minimum error value, the optimization procedure will adjust each component value in weighting vector $ \text{w} $. Let’s look at the comparison of variance (the table below). After observation we can find that the variance of components in each weighting vector decreased, which means the local “tremblings” are suppressed.

frequency	1	3	5	7	9	11	13	15	17
$ \lambda = 0 $ variance	0.0342	0.0774	0.0551	0.1171	0.0532	4.0241	7.8691	40.2198	92.4461
$ \lambda = 2 $ variance	0.0323	0.0610	0.0387	0.0289	0.0227	0.0187	0.0160	0.0139	0.0123
$ \lambda = 50 $ variance	0.0122	0.0096	0.0060	0.0043	0.0034	0.0028	0.0024	0.0021	0.0018

The penalizing from $ \lambda $ is reduced for a lower error. It will also be easy to imply:

$$
\begin{align} & \lambda \rightarrow 0 & \Rightarrow \ \ \ \ & \text{w}_{ridge} \rightarrow \text{w}_{LSR} \\ & \lambda \rightarrow \infty & \Rightarrow \ \ \ \ & \text{w}_{ridge} \rightarrow 0 \end{align}
$$

So everything should be clear, the optimization of $ E_{ridge} (\text{w}) $ is similar with the LSR (do the formula derivation yourself :D). Result should be:

$$
\hat{\text{w}} = (\text{X} \text{X}^T + \lambda \text{I})^{-1} \text{X} t
$$

Here is a simple python implementation and the MATLAB plot:

def do_train(self, freq):
    """
    Perform train on input training data.
    Args:
        freq: Frequency for Fourier basis function: scalar
    Returns:
        Result trained weight vector, numpy.ndarray: (2 * freq + 1) x 1
    """
    # Mapping with basis function.
    ext_pnt = self.fourier_basis(self.train_data, freq)
    penalize = self.lam * np.eye(ext_pnt.shape[1])

    # Calculating and returning w.
    weight = np.dot(np.transpose(ext_pnt), ext_pnt) + penalize
    weight = np.dot(np.linalg.inv(weight), np.transpose(ext_pnt))
    return np.dot(weight, self.train_label)

def do_test(self, weight, freq, test_data, test_label):
    """
    Perform test on input testing data.
    Args:
        weight: The trained weight, numpy.ndarray: (2 * freq + 1) x 1
        freq: The frequency for Fourier basis function: scalar
        test_data: Test data points, numpy.ndarray: num x 1
        test_label: Test data labels, numpy.ndarray: num x 1
    Returns:
        Test error, mean squared error is utilized: scalar
    """
    # Mapping with basis function.
    ext_pnt = self.fourier_basis(test_data, freq)

    # Testing and calculating error.
    test_res = np.dot(ext_pnt, weight)
    return self._mse(test_res, test_label)

As above mentioned, the result with lambda 0 shows the same result with LSR.

The result with lambda 2 looks much better, the overfitting problem is well suppressed and the generally trend is also preserved.

All the results with lambda 50 look similar. Although the overfitting problem is solved, but such suppression also causes a loss to precision.

Regularization and Selection of $ \lambda $

Ridge Regression introduces a method to prevent overfitting. This method penalizes the size of weighting vector $ \text{w} $ with a suitable coefficient $ \lambda $. Such method, is called Regularization. As above description, a “suitable” $ \lambda $ is the key to get a correct result. This chapter will introduce some information about regularization and the selection of $ \lambda $.

Regularization

Wikipedia:

Regularization, in mathematics and statistics and particularly in the fields of machine learning and inverse problems, refers to a process of introducing additional information in order to solve an ill-posed problem or to prevent overfitting.

The most common “introduced information” in Machine Learning is appending a variable to the error function, which is composed of the penalizing coefficient and L1-norm or L2-norm form weighting vector. Our Ridge Regression use the L2-norm of the weighting vector.

Except Ridge Regression, there are also other methods utilizing the regularization. For example, Lasso Regression, which use L1-norm of weighting vector. (Maybe I would talk about it in the future posts)

Selection of $ \lambda $

Of course, the value of $ \lambda $ could be obtained by empirical knowledge, but this is not always the good choice (or we can just say it’s a bad idea).

A good solution is that trying $ \lambda $ with different values on some training sets and testing sets. But the problem is, in general, good training testing sets could be very expensive. Therefore, with one single training/test set, we can utilize the cross-validation.

Among different types of cross-validation, the most intuitive one is the “leave-one-out” cross-validation. At first, it equally divides the data sets into $ n $ subsets. Select one of these subsets as the test set, and the remaining subsets are the training sets. Doing the training/testing for n times (with different testing sets), then a relative more suitable $ \lambda $ value is obtained.

“Leave-one-out” cross-validation is a kind of exhaustive cross-validation. There are still some non-exhaustive cross-validation like “k-fold” cross-validation.

Summary

Comparing with LSR, Ridge Regression with suitable penalizing on weighting vector $ \text{w} $’s size can dramatically reduce the effect from overfitting. Such technique is called regularization. Hope they can help you :D

[1] Python code
[2] Training & Testing Data
[3] Ridge Regression
[4] Regularization
[5] L1-norm
[6] L2-norm
[7] Lasso Regression
[8] Cross-validation

[ML Review] Ridge Regression & Regularization

Introduction

Ridge Regression

Regularization and Selection of \( \lambda \)

Regularization

Selection of \( \lambda \)

Summary

frequency	1	3	5	7	9	11	13	15	17
\( \lambda = 0 \) variance	0.0342	0.0774	0.0551	0.1171	0.0532	4.0241	7.8691	40.2198	92.4461
\( \lambda = 2 \) variance	0.0323	0.0610	0.0387	0.0289	0.0227	0.0187	0.0160	0.0139	0.0123
\( \lambda = 50 \) variance	0.0122	0.0096	0.0060	0.0043	0.0034	0.0028	0.0024	0.0021	0.0018