[ML Review] Bayesian Curve Regression

Introduction

Based on the previous post, we will go deeper into regression from probability perspective in this post.

In LSR and RR, our training results are some single values of weighting vector \( \text{w} \), with which we can predict new input points simply by using basis function and dot product. The RR works already quite well, but the value selection of regularization coefficient is really intractable. To deal with this challenge, Bayesian curve fitting could help.

Linear Regression in Bayesian Approach

Training

Following is the training phase of Bayesian Curve Fitting. Given a training set \( \text{X} = \{ \text{x}_{1}, …, \text{x}_{N} \} \), relative labels \( \text{t} = \{ t_{1}, …, t_{N} \} \) and the prior distribution hyperparameter (Gaussian here) \( \alpha = (m_0, S_0) \), specifying a suitable basis function \( \phi \). Our goal is the distribution of the weighting vector:

  1. Calculating the optimized weighting vector \( \text{w}_{opt} \) with LSR:

    $$
    \text{w}_{opt} = \left( \phi(\text{X}) \phi(\text{X})^T \right )^{-1} \phi(\text{X}) \text{t}
    $$

  2. Calculating the precision \( \beta \):

    $$
    \frac{1} {\beta} = \frac{1} {N} \sum_{n = 1}^{N} {[t_n - \text{w}_{opt}^T \phi(\text{x}_{n}) ]^2}
    $$

  3. Calculating the MAP solution of \( \hat{\text{w}} \):

    $$
    \hat{\text{w}} = \left( \beta \phi(\text{X}) \phi(\text{X})^T + {S_{0}}^{-1} \text{I} \right)^{-1} \left( \beta \phi(\text{X})^T \text{t} + {S_{0}}^{-1} m_{0} \right)
    $$

    For simplifying the calculation, we often assume that the mean \( m_0 \) in hyperparameter is \( 0 \):

    $$
    \hat{\text{w}} = \left( \beta \phi(\text{X}) \phi(\text{X})^T + {S_{0}}^{-1} \text{I} \right)^{-1} \beta \phi(\text{X})^T \text{t}
    $$

  4. Calculating the posterior distribution:

    We have assumed the Gaussian noise (likelihood) and a prior Gaussian distribution over weighting vector. These will result in a Gaussian posterior distribution over the result weighting vector \( \hat{\text{w}} \), because our prior and likelihood here are conjugate. The prior is also called conjugate prior. In this post I don’t want to explain it, which could make you puzzled. But it will be introduced in some future posts.

    OK let’s go back to our training phase. The MAP solution \( \hat{\text{w}} \) is still a single value. Think about it: our posterior distribution is Gaussian, so this \( \hat{\text{w}} \) should be the mean \( m_N \) of the posterior distribution! It indicates the maximum value in this multivariable Gaussian distribution, because it is the optimized result from MAP. Therefore:

    $$ m_N = \hat{\text{w}} $$

    And the covariance matrix:

    $$ S_N = \left( \beta \phi(\text{X}) \phi(\text{X})^T + {S_{0}}^{-1} \text{I} \right)^{-1} $$

    So the posterior distribution:

    $$ p(\text{w} | \text{X}, \text{t}, \beta, \alpha) = \mathcal{N}(\text{w} | m_N, S_N) $$

Predictive Distribution

So far, the training phase is finished. But wait, the weighting vector is now actually a distribution, how can we predict an input point with a distribution? According to Mr. Bishop’s book Pattern Recognition and Machine Learning (PRML), the answer is calculating the integral (continuous) / summation (discrete) over \( \text{w} \)’s distribution!

$$
p(t | x, \text{X}, \text{t}) = \int { p(t | x, \text{w}) p(\text{w} | \text{X}, \text{t}, \beta, \alpha) \text{dw} }
$$

\(p(t | x, \text{w}) = \mathcal{N} (t | y(x, \text{w}), \beta^{-1})\) describes the Gaussian noise distribution and \( p(\text{w} | \text{X}, \text{t}, \beta, \alpha) = \mathcal{N} (\text{w} | m_N, S_N)\) is the MAP solution. Then the posterior distribution is a Gaussian and can be evaluated analytically. According to the PRML, the result is:

$$
p(t | x, \text{X}, \text{t}) = \mathcal{N} (t | m(x), s^2(x))
$$

where mean and variance are

$$
m(x) = \beta \phi(x)^T \text{S} \phi(X) \text{t} \\
s(x)^2 = \beta^{-1} + \phi(x) \text{S} \phi(x)
$$

and \( \text{S} \) is the regularized covariance matrix:

$$
\text{S} = S_N = (\beta \phi(X) \phi(X)^T + S_{0}^{-1} \text{I})^{-1}
$$

To be noticed is that both of the mean and the variance depend on the input point. It means that the input points could also affect the distribution. Let me show you an example. The data I have prepared is not quite good, so I highly recommend you read the example on PRML.

This example illustrates a simulated procedure of training (but our training is not implemented like this). The blue curve is our optimized result. The two black thiner curves show the variance \(s(x)\) of every points on the blue curve. They have been rescaled suitably to make them visible, so for example it DOES NOT mean that the point at \(x = 0\) has variance about \(1.25\).

It is obvious that in the first figure the variances at the points are quite low, and other places are relative higher. With more points inserted, the variances become “well-distributed”. (Remember I have adjust the scale to make the black curves visible, so the variances of the first two points in the rest two figures look higher than in the first figure).

Summary

In this post we reviewed the Bayesian Curve Fitting. Bayesian is quite general and intuitive, what we need are just the training set and a prior. But unfortunately, in general, it could be impossible or impractical to derive the posterior distribution analytically. However, it is still possible to approximate the posterior by approximate Bayesian inference methods, which will be reviewed in the future.

Furthermore, the Gaussian noise and a Gaussian prior distribution over weighting vector are assumed here. That is, there could exist other distribution combinations. Or we can say under other noise distribution assumptions, we may should change the prior distribution. In the future I would also review some other distribution and relative conjugate priors. Hope it could help :D

Reference

[1] data
[2] Conjugate Prior
[3] Bayesian Linear Regression
[4] Bishop, Christopher M. Pattern recognition and machine learning. springer, 2006.