Introduction
In machine learning, regression is utilized to predict continuous values. Least squares regression is one of the simplest regression technology. Given the training set: \( \text{X} = \{x_1, …, x_N\} \) with target values \( T = \{t_1, …, t_N\} \). Our goal is to obtain a function, which can “best” describe these points and predict other input points’ target values. But how can we obtain such a function?
Linear
Firstly, we need to “guess” the form of this function. The simplest is the linear function: \( y = kx + b \), just like what we have learnt in middle school. Let’s write it as \( t = w \cdot x + w_0 \), where \(t\) is the target value, \(w\) is the weight and \(w_0\) is bias. Therefore, to get this function, our goal now becomes calculating the value of \(w\) and \(w_0\).
It is obvious that we need to use some points to calculate the weight and bias. Such points named training sets. In general, training sets contain two parts: some data points \(\text{X} = [x_1, …, x_N] \) and their labels (or target values) \( t = [t_1, …, t_N] \). The formula becomes \( w^T \text{X} + w_0 = t \). In most cases we can write it as \( \widetilde{\text{w}}^T \widetilde{\text{X}} = t \), where \( \widetilde{\text{w}} = [w, w_0]^T \), \( \widetilde{\text{X}} = [\widetilde{x}_1, …, \widetilde{x}_N] \) and \( {\widetilde{x}}_i = [x_i, 1]^T \).
How to calculate \( \widetilde{\text{w}} \) now? Let’s consider about it from another perspective. If a suitable \( \widetilde{\text{w}} \) is given, we can easily calculate a set of labels \( t^\prime \) for the above training set \( \widetilde{\text{X}} \). But in general the components in \( t - t^\prime \) cannot be zero because the points don’t always (or we can say never) fit perfectly. (See the picture below, click to enlarge)
In other words, if there exists a \( \widetilde{\text{w}} \), which makes the difference between \( t^\prime \) and \( t \) minimal, then this linear function with \( \widetilde{\text{w}} \) can “best” describe these points. Such \( \widetilde{\text{w}} \) is what we want. Therefore we can get it via computing the derivative of error function \( E(\widetilde{\text{w}}) = \left|\left| \widetilde{\text{w}}^T \widetilde{\text{X}}- t \right|\right|^2 \):
$$
\begin{align}
&& \frac{\partial E(\widetilde{\text{w}})}{\partial \widetilde{\text{w}}} & = 2 \widetilde{\text{X}} (\widetilde{\text{w}}^T \widetilde{\text{X}} - t) \stackrel{!}{=} 0 \\
& \Rightarrow & \widetilde{\text{X}} \widetilde{\text{X}}^T \widetilde{\text{w}} & = \widetilde{\text{X}} t \\
& \Rightarrow & \quad \widetilde{\text{w}} & = (\widetilde{\text{X}} \widetilde{\text{X}}^T)^{-1} \widetilde{\text{X}} t
\end{align}
$$
The Python code. (Python 3 + numpy)
1 | import numpy as np |
Let’s see its result on a given training set:
Non-linear
Although the obtained line describes the trend of these data points, but it’s not what we want. Why? Because it doesn’t fit the points well, our function is linear but the data points implies a non-linear model! Our “guess” is not suitable for this case. So the solution should be clear: using some non-linear functions instead of the pure linear one, e.g. in polynomial form like \(t = ax^2 + bx + c\) (of course this function is not suitable for the above case, it’s just an example :P), or in Fourier form like \(t = \cos{(2 \pi x)} + \sin{(2 \pi x)} + 1\).
There could be many different forms of such non-linear functions, but for convenience, we need to unify them into a linear form: \(t = \text{w}^T \phi{(\text{X})}\). The \(\phi\) here is the so called basis function, it maps our points (1-D in our case) into a non-linear form for example \(\phi{(x)} = (1, x, x^2)^T\). The weight \(\text{w}\) now should be in form of \((w_0, w_1, w_2)^T\). In this case, our function will look like \(t = \text{w}^T \phi{(\text{X})} = w_0 + w_1 x + w_2 x^2\), where \(w_0\) is the bias (so the basis function should provide a \(1\) component for the bias in the result like ours). It’s now in non-linear form! Let’s perform it on our data with basis function:
$$ \phi{(x)} = (\phi_{0} (x), \phi_{1} (x), \phi_{2} (x), …) $$
where
$$
\begin{align} \phi_{0} (x) & = 1 \\ \phi_{2n - 1} (x) & = \frac{\cos{(2 \pi nx)}}{n} \\ \phi_{2n} (x) & = \frac{\sin{(2 \pi nx)}}{n} \end{align}
$$
1 | import numpy as np |
Let’s see the results with different \( n \) values (i.e. different frequency values).
At \( n = 1 \), the curve fits the trianing points not bad. When it increases to \( 3 \) and \( 5 \), the fitting looks much better. But after \( n = 7 \), things become out of control. The curves begin to “tremble” locally, they seem to try to fit every point in the training set. As a human, we definitely know that the correct result should look like the \( n = 3\) or \( n = 5 \) cases. That is, if we give the curve of \( n = 17 \) some test points, the curve will output worse results than the \( n = 3 \) or \( n = 5 \) cases, although it could better fit the training points. Such situation is called overfittting.
The comparison of training and testing error rate is shown in the belowing picture. The mean squared error is utilized here. It’s obvious that training MSE decreased all the time, but testing MSE experienced decreasement firstly, then dramatically increased.
Summary
Therefore, in least squares regression, more complex models (e.g. larger \( n \)) don’t always mean better predication results. Both selection of basis function and overfitting are challenges. I will review a solution to overfitting of least squares regression in next article. Some more general methods will also be introduced in the future. Hope they can help you :D
Reference
[1] Python code
[2] Training & Testing Data
[3] Basis function
[4] Overfitting
[5] Mean Squared Error