[ML Review] Regression from Probabilistic Perspective

Introduction

I have reviewed the Least Squares Regression (LSR) and Ridge Regression (RR). In this post, I will review them again, but from probabilistic perspective.

Probabilistic Interpretation: LSR

In LSR and RR, we optimized the results by minimizing the error (sum of squares error was utilized). We can find the training error cannot be 0, it’s not quite precise. This is because in general, the data sets contain noise. Therefore, if we want to consider the LSR and RR from probability perspective, we could use probability to express those noise. For example, given a training set, let’s assume the noise of points in this set have the Gaussian distribution like this (The Gaussian distribution curve has been adjusted for better appearance):

Our target function values \( t \) is now:

$$
t = \widetilde{\text{w}}^T \widetilde{\text{x}} + \epsilon = y(\text{x}, \text{w}) + \epsilon
$$

\( \epsilon \) here is the noise, which has Gaussian distribution, \( y(\text{x}, \text{w}) \) is the target curve (or ideally we can say it’s the curve without error). Transforming it into probability form, for every single point \( \text{x} \) it will look like:

$$
p(t | \text{x}, \text{w}, \beta) = \mathcal{N} (t | y(\text{x}, \text{w}), \beta^{-1})
$$

It means that given a point \( \text{x} \), relative weighting vector \( \text{w} \) and the precision parameter \( \beta \) (talk about it later), the target value \( t \) has Gaussian distribution. Therefore, to learn the weighting vector \( \text{w}\) and \( \beta \), we can just maximize the conditional likelihood with the training set \( \text{X} \) and its label set \( \text{t}\):

$$
p(\text{t} | \text{X}, \text{w}, \beta) = \prod_{n = 1}^{N} {\mathcal{N} (t_n | y(\text{x}_n, \text{w}), \beta^{-1})}
$$

For simplifying the calculation, writing it in \( \log \) form:

$$
\begin{align} \log{p(\text{t} | \text{X}, \text{w}, \beta)} & = \sum_{n = 1}^{N} {\log{\mathcal{N} (t_n | y(\text{x}_n, \text{w}), \beta^{-1})}} \\ & = \sum_{n = 1}^{N} { \left[\log{( \sqrt{\frac{\beta} {2 \pi}})} - \frac{\beta} {2} \{ y(\text{w}_n, \text{w}) - t_n \}^2 \right] } \\ & = \frac{N} {2} \log{\beta} - \frac{N} {2} \log{(2 \pi)} - \frac{\beta} {2} \sum_{n = 1}^{N} \{ t_n - y(\text{x}_n, \text{w}) \} \end{align}
$$

where \( \frac{N} {2} \log{\beta} - \frac{N} {2} \log{(2 \pi)} \) is a constant, and the rest is the sum of squares error. Calculating its Gradient w.r.t \( \text{w} \) we can get:

$$
\begin{align} \nabla_{\text{w}} \log {p(\text{t} | \text{X}, \text{w}, \beta)} & = - \beta \sum_{n = 1}^{N} {(t_n - \text{w}^T \phi(\text{x}_n)) \phi(\text{x}_n)} \stackrel{!}{=} 0 \\ & \Rightarrow \sum_{n - 1}^{N} {t_n \phi(\text{x}_n)} = [\sum_{n - 1}^{N} {\phi(\text{x}_n) \phi(\text{x}_n)^T}] \text{w} \\ & \Rightarrow \phi(\text{X})^T \text{t} = \phi(\text{X}) \phi(\text{X})^T \text{w} \\ & \Rightarrow \hat{\text{w}} = (\phi(\text{X}) \phi(\text{X})^T)^{-1} \phi(\text{X}) \text{t} \end{align}
$$

Does it look familiar? It’s the same with LSR’s optimization result! We should have already got a better understanding about what Linear Squares Regression is. The LSR is equivalent to Maximum Likelihood Estimation under the assumption of Gaussian noise. Now let’s take a look at the previous mentioned \( \beta \). If we calculate the gradient w.r.t \( \beta \), we can get the optimization result like this (do it yourself :D):

$$
\frac{1} {\hat{\beta}} = \frac{1} {N} \sum_{n = 1}^{N} {[t_n - \hat{\text{w}}^T \phi(\text{x}_{n}) ]^2}
$$

The higher precision the \( \beta \) has, the lower mean squared error the data there would be. After getting these two values, we can predict new input with the following distribution:

$$
p(t | \text{x}, \hat{\text{w}}, \hat{\beta}) = \mathcal{N} \left( t | y(\text{x}, \hat{\text{w}}), \hat{\beta}^{-1} \right)
$$

So far, these illustrations are still based on the frequentist approach, let’s now stride towards the Bayesian approaches.

Probabilistic Interpretation: RR

Because of the equivalence between LSR and Maximum Likelihood Estimation under assumption of Gaussian noise, they also share the same challenge: overfitting. I have illustrated it in this post. To avoid overfitting, in Ridge Regression, the regularization coefficient \( \lambda \) is involved to penalize the larger size weighting vectors. One way to get such \( \lambda \) is to take advantage of the empirical knowledge. So what is similar to “empirical knowledge” in probability field? The prior! Even so, it sounds quite abstract, let me illustrate it in formula.

Firstly I want to repeat our previous question again: Given a training set \( \text{X} \) and relative labels \( \text{t} \), our goal is to learn a single weighting vector \( \text{w} \), which “best” describes the data points. To avoid overfitting, we can provide some “guidance” to our model. Such “guidance” is a prior, which describes the prior distribution of the weighting vector \( \text{w} \). ATTENTION: it is the distribution of \( \text{w} \), rather than a single value of it.

Since it is a distribution, there must exist some parameters describing this distribution. We call such parameters the hyperparameter. For example, assume a given prior describing the \( \text{w} \) in Gaussian distribution. The hyperparameter \( \alpha = \left( m_{0}, S_{0} \right)\), where \( m_{0} \) is the mean, \( S_{0} \) is the variance. Therefore the distribution shall look like this:

$$
p(\text{w} | \alpha) = \mathcal{N} \left(\text{w} | m_{0}, S_{0} \text{I} \right) = \left( \frac{1} {2 \pi S_{0}} \right)^{(M + 1) / 2} \exp \left(- \frac{1} {2 S_{0}} (\text{w} - m_{0})^T (\text{w} - m_{0}) \right)
$$

Our posterior distribution over \( \text{w} \) is extended and according to the Bayes’ Theorem:

$$
p(\text{w} | \text{X}, \text{t}, \beta, \alpha) \propto p(\text{t} | \text{X}, \text{w}, \beta) p(\text{w} | \alpha)
$$

Minimizing its negative \( \log \) form:

$$
\begin{align} - \log {p(\text{w} | \text{X}, \text{t}, \beta, \alpha)} & \propto - \log{p(\text{t} | \text{X}, \text{w}, \beta)} - \log{p(\text{w} | \alpha)} \\ & \propto - \log{ \prod_{n = 1}^{N} {\mathcal{N} (t_n | y(\text{x}_n, \text{w}), \beta^{-1})} } - \log{ \exp \left(- \frac{1} {2 S_{0}} (\text{w} - m_{0})^T (\text{w} - m_{0}) \right) } \\ & \propto \frac{\beta}{2} \sum_{n = 1}^{N} \{ y(\text{x}_n, \text{w}) - t_n\}^2 + \frac{1} {2 S_0} (\text{w} - m_{0})^T (\text{w} - m_{0}) + \text{const} \\ & \propto \frac{\beta}{2} \left( \text{w}^T \phi(\text{X}) - \text{t} \right)^2 + \frac{1} {2 S_0} (\text{w} - m_{0})^T (\text{w} - m_{0}) + \text{const} \end{align}
$$

Calculating the gradient we can get the optimized weighting vector \( \hat{\text{w}} \):

$$
\begin{align} \nabla_{\text{w}} \log {p(\text{t} | \text{X}, \text{w}, \beta, \alpha)} & = - \beta \sum_{n = 1}^{N} {(t_n - \text{w}^T \phi(\text{x}_n)) \phi(\text{x}_n)} + {S_{0}}^{-1} (\text{w} - m_{0}) \stackrel{!}{=} 0 \\ & \Rightarrow \beta \cdot \phi(\text{X})^T \text{t} + {S_{0}}^{-1} m_{0} = \left( \beta \cdot \phi(\text{X}) \phi(\text{X})^T + {S_{0}}^{-1} \text{I} \right) \text{w} \\ & \Rightarrow \hat{\text{w}} = \left( \beta \cdot \phi(\text{X}) \phi(\text{X})^T + {S_{0}}^{-1} \text{I} \right)^{-1} \left( \beta \cdot \phi(\text{X})^T \text{t} + {S_{0}}^{-1} m_{0} \right) \end{align}
$$

The result looks quite complicated, but in general, for simplification, we often set the prior mean \( m_{0} = 0\). Then it will similar to the RR optimization result, and the regularization parameter \( \lambda \) in Ridge Regression will be equal to \( \frac{1} {S_{0} \beta} \). This is the so called Maximum a posteriori (MAP) estimation.

Summary

So far, we have a better understanding to both of the ordinary linear regression (Least Squares Regression) and the regularized regression (Ridge Regression). The Least Squares Regression is equivalent to Maximum Likelihood Estimation under Gaussian noise assumption, and they share the same problem: overfitting. The Ridge Regression is equivalent to Maximum a Posterior estimation with a Gaussian prior distribution over weighting vector \( \text{w} \). The probabilistic interpretation to RR is already alike to the Bayesian estimation. In next post, I will try to introduce the Bayesian Curve Fitting. Hope it could help :D

Reference

[1] Least Squares Regression
[2] Ridge Regression
[3] Maximum Likelihood Estimation
[4] frequentist approach
[5] hyperparameter
[6] Maximum a posteriori (MAP) estimation
[7] Bayes’ Theorem
[8] Bishop, Christopher M. Pattern recognition and machine learning. springer, 2006.