Introduction
In the previous post, the Least Squares Regression (LSR) is reviewed. But the last shown example overfits when the frequency becomes higher (the basis function maps input points into higher dimensional feature space). This post will review a method called Ridge Regression against such overfitting. Furthermore, the concept of regularization will also be introduced.
Ridge Regression
Before formally introduction to the Ridge Regression, let’s firstly think about, why overfitting happens in LSR. In our previous example, when frequency is \( 1 \), the trained function is not “flexible” enough. It can only express one pair of “wave crest” and “wave trough” within our training data’s range. Because of such limitation, the optimization have to compromise. Calculate a minimum error and getting a “trend description” as the final result.
When the frequency increases to a larger number, the result functions become quite flexible. There are enough local “crest” and “trough” to fit the training points, which lead to local “trembling”, or in other words, overfitting. Our goal is now clear, we need to suppress such local “trembling” and keep its generally trend.
In our error function: \( E(\text{w}) = \left|\left| t - \text{w}^T \text{X} \right|\right|^2 \), with the increasement of basis function mapped feature space’s dimensional, the only changed thing is the weighting vector \( \text{w} \)’s dimension. If we penalize \( \text{w} \)’s size (number of dimension), results would be better: \( E_{ridge} (\text{w}) = \left|\left| t - \text{w}^T \text{X} \right|\right|^2 + \lambda \left|\left| \text{w} \right|\right|^2 \). Given a suitable \( \lambda \) (scalar, talk about its value selection later), higher \( \text{w} \) size will be penalized and “trembling” will be suppressed.
Not quite intuitive? Assume that \( \text{w}_{opt} \) is the optimized result from LSR error function \( E(\text{w}) \). But it could not be the minimum-value-point in Ridge Regression error function \( E_{ridge} (\text{w}) \) because of the insertion of \( \lambda \left|\left| \text{w}_{opt} \right|\right|^2 \). Therefore to achieve the minimum error value, the optimization procedure will adjust each component value in weighting vector \( \text{w} \). Let’s look at the comparison of variance (the table below). After observation we can find that the variance of components in each weighting vector decreased, which means the local “tremblings” are suppressed.
frequency | 1 | 3 | 5 | 7 | 9 | 11 | 13 | 15 | 17 |
---|---|---|---|---|---|---|---|---|---|
\( \lambda = 0 \) variance | 0.0342 | 0.0774 | 0.0551 | 0.1171 | 0.0532 | 4.0241 | 7.8691 | 40.2198 | 92.4461 |
\( \lambda = 2 \) variance | 0.0323 | 0.0610 | 0.0387 | 0.0289 | 0.0227 | 0.0187 | 0.0160 | 0.0139 | 0.0123 |
\( \lambda = 50 \) variance | 0.0122 | 0.0096 | 0.0060 | 0.0043 | 0.0034 | 0.0028 | 0.0024 | 0.0021 | 0.0018 |
The penalizing from \( \lambda \) is reduced for a lower error. It will also be easy to imply:
$$
\begin{align} & \lambda \rightarrow 0 & \Rightarrow \ \ \ \ & \text{w}_{ridge} \rightarrow \text{w}_{LSR} \\ & \lambda \rightarrow \infty & \Rightarrow \ \ \ \ & \text{w}_{ridge} \rightarrow 0 \end{align}
$$
So everything should be clear, the optimization of \( E_{ridge} (\text{w}) \) is similar with the LSR (do the formula derivation yourself :D). Result should be:
$$
\hat{\text{w}} = (\text{X} \text{X}^T + \lambda \text{I})^{-1} \text{X} t
$$
Here is a simple python implementation and the MATLAB plot:
1 | def do_train(self, freq): |
As above mentioned, the result with lambda 0 shows the same result with LSR.
The result with lambda 2 looks much better, the overfitting problem is well suppressed and the generally trend is also preserved.
All the results with lambda 50 look similar. Although the overfitting problem is solved, but such suppression also causes a loss to precision.
Regularization and Selection of \( \lambda \)
Ridge Regression introduces a method to prevent overfitting. This method penalizes the size of weighting vector \( \text{w} \) with a suitable coefficient \( \lambda \). Such method, is called Regularization. As above description, a “suitable” \( \lambda \) is the key to get a correct result. This chapter will introduce some information about regularization and the selection of \( \lambda \).
Regularization
Regularization, in mathematics and statistics and particularly in the fields of machine learning and inverse problems, refers to a process of introducing additional information in order to solve an ill-posed problem or to prevent overfitting.
The most common “introduced information” in Machine Learning is appending a variable to the error function, which is composed of the penalizing coefficient and L1-norm or L2-norm form weighting vector. Our Ridge Regression use the L2-norm of the weighting vector.
Except Ridge Regression, there are also other methods utilizing the regularization. For example, Lasso Regression, which use L1-norm of weighting vector. (Maybe I would talk about it in the future posts)
Selection of \( \lambda \)
Of course, the value of \( \lambda \) could be obtained by empirical knowledge, but this is not always the good choice (or we can just say it’s a bad idea).
A good solution is that trying \( \lambda \) with different values on some training sets and testing sets. But the problem is, in general, good training testing sets could be very expensive. Therefore, with one single training/test set, we can utilize the cross-validation.
Among different types of cross-validation, the most intuitive one is the “leave-one-out” cross-validation. At first, it equally divides the data sets into \( n \) subsets. Select one of these subsets as the test set, and the remaining subsets are the training sets. Doing the training/testing for n times (with different testing sets), then a relative more suitable \( \lambda \) value is obtained.
“Leave-one-out” cross-validation is a kind of exhaustive cross-validation. There are still some non-exhaustive cross-validation like “k-fold” cross-validation.
Summary
Comparing with LSR, Ridge Regression with suitable penalizing on weighting vector \( \text{w} \)’s size can dramatically reduce the effect from overfitting. Such technique is called regularization. Hope they can help you :D
[1] Python code
[2] Training & Testing Data
[3] Ridge Regression
[4] Regularization
[5] L1-norm
[6] L2-norm
[7] Lasso Regression
[8] Cross-validation