[ML Review] Probability Distributions and Conjugate Prior

Introduction

In the Bayesian curve fitting, I mentioned that the Gaussian noise assumption (likelihood) and prior Gaussian distribution over weighting vector will result in a Gaussian posterior. The likelihood and prior are conjugate, the prior is also called the conjugate prior.

About Conjugate Prior

In general, we will meet the following probabilities: $ p(\theta) : prior, p(\theta | X) : posterior, p(X), p(X | \theta) : likelihood $. According to the Bayes’ Theorem, they could be written as:

$$
posterior = \frac{likelihood \times prior} {p(X)}
$$

To make the posterior calculation intuitive, we often need that the posterior and the prior have the same “form” (just like in Bayesian curve fitting both of them are Gaussian). Furthermore, the posterior could also be used as the prior of future calculation, which will result in a “calculating chain”:

$$
\begin{align}
posterior_1 & = \frac{likelihood \times prior} {p(X)} \\
posterior_2 & = \frac{likelihood^{‘} \times posterior_1} {p(X^{‘})} \\
& …
\end{align}
$$

If there exists a suitable form of likelihood, which make the posterior have the same form like prior, then the prior and likelihood are conjugate. That is, conjugate is about the PRIOR and the LIKELIHOOD.

Previously we used the Gaussian noise assumption (likelihood) and a prior Gaussian distribution. So what about the situations with different noise distributions? In this post I will review them and conclude their conjugate prior.

Probability Distributions

There are three kinds of distributions will be introduced in the following sections. They are the binary distribution (Bernoulli distribution, Binomial distribution and Beta distribution), multinomial distribution (Dirichlet distribution) and continuous distribution (Gaussian distribution and Gamma distribution).

Bernoulli, Binomial and Beta Distribution

One of the most simplest probability distribution is the Bernoulli distribution, which could be expressed by flipping a coin. The random variable $ x \in \{ 0, 1 \} $, where $ x = 1 $ means head and $ x = 0 $ means tail. The question now is “what is the probability of flipping this coin and getting head as result?” We can denote it by $ \mu $:

$$
p(x = 1 | \mu) = \mu
$$

Therefore the Bernoulli distribution over $x$ is written as:

$$
\begin{align}
\text{Bern}(x | \mu) & = \mu^{x}(1 - \mu)^{1 - x} \\
\mathbb{E}[x] & = \mu \\
\text{var}[x] & = \mu (1 - \mu)
\end{align}
$$

Bernoulli distribution describes the single coin situation. The Binomial distribution goes further. It describes the probability distribution of $N$ coins flipping. The question becomes “the probability of flipping a coin for $N$ times and $m$ times of which are head-up in the end”:

$$
\begin{align}
\text{Bin}(m | N, \mu) & = \left( \begin{array}{c} N \\ m \end{array} \right) \mu^{m} (1 - \mu)^{N - m} \\
\mathbb{E}[m] & \equiv \sum_{m = 0}^{N} {m\text{Bin}(m | N, \mu) = N \mu} \\
\text{var}[m] & \equiv \sum_{m = 0}^{N} {(m - \mathbb{E}[m])^2 \text{Bin} (m | N, \mu)} = N \mu (1 - \mu)
\end{align}
$$

It is obvious that Bernoulli distribution is a special case of the Binomial distribution ($N = 1$). If we draw a histogram for Binomial distribution $\text{Bin}(m | 10, 0.25)$, it looks like:

Now, given a set of observation $ \mathcal{D} = \{ x_1, …, x_M \} $, where each $ x_i, i \in \{1, …, M\} $, is the flipping result ($0$ for tail-up, $1$ for head-up). To predict new input and avoid overfitting, we could employ Bayesian estimation (rather than maximum likelihood). The likelihood is:

$$
p(\mathcal{D} | \mu) \propto \prod_{m = 1}^{M} {\mu^{x_m} (1 - \mu)^{1 - x_m}}
$$

For the reasons mentioned above, we need to find out a prior distribution which has the same function form (conjugate prior) like this:

$$
p(\mu | a, b) \propto \mu^{a} (1 - \mu)^{b}
$$

This can be related to the Beta Distribution:

$$
\begin{align}
\text{Beta}(\mu | a, b) & = \frac{\Gamma (a + b)} {\Gamma (a) \Gamma (b)} \mu^{a - 1} (1 - \mu)^{b - 1} \\
\Gamma (x) & \equiv \int_{0}^{\infty} {u^{x - 1} e^{-u}} \text{d}u \\
\mathbb{E}[\mu] & = \frac{a} {a + b} \\
\text{var}[\mu] & = \frac{ab} {(a + b)^{2} (a + b + a)}
\end{align}
$$

$\Gamma (x)$ is the gamma function, it can also be written as factorial function $ \Gamma (n + 1) = n! = n \Gamma (n) $ when the input is a positive integer. $a$ and $b$ are often called hyperparameters. The Beta distribution looks like this:

Now let’s apply it with Bayesian approach. $ \mathcal{D} = \{x_1, …, x_M\} $:

$$
\begin{align}
p(\mu | a_0, b_0, \mathcal{D}) & \propto p(\mathcal{D} | \mu) p(\mu | a_0, b_0) \\
& = \left( \prod_{n - 1}^{M} {\mu^{x_n} (1 - \mu)^{1 - x_n}} \right) \text{Beta} (\mu | a_0, b_0) \\
& \propto \mu^{m + a_0 - 1} (1 - \mu)^{(M - m) + b_0 - 1} \\
& \propto \text{Beta}(\mu | a_M, b_M)
\end{align}
$$

where $m$ means the number of head-up results, and $ a_M = a_0 + m, b_M = b_0 +(M - m) $. And we can easily find that hyperparameters $a$ and $b$ are actually the effective number of observations for head-up ($x = 1$) and tail-up ($x = 0$) respectively. Because the posterior is in form of Beta distribution and has the same form with the prior we have just used, it can also be utilized as a prior when new observed data come. As the size of observed data set increases ($M$ increases):

$$
\begin{align}
a_M & \rightarrow m \\
b_M & \rightarrow M - m \\
\mathbb{E}[\mu] & \rightarrow \frac{m}{M} = \mu_{ML} \\
\text{var}[\mu] & \rightarrow 0
\end{align}
$$

“What is the probability that the next coin toss will land heads up?” is the question of prediction:

$$
\begin{align}
p(x = 1 | a_0, b_0, \mathcal{D}) & = \int_{0}^{1} {p(x = 1 | \mu) p(\mu | a_0, b_0, \mathcal{D})} \text{d} \mu \\
& = \int_{0}^{1} {\mu p(\mu | a_0, b_0, \mathcal{D})} \text{d} \mu \\
& = \mathbb{E}[\mu | a_0, b_0, \mathcal{D}] = \frac{a_N}{a_N + b_N}
\end{align}
$$

Therefore it’s equivalent to the total fraction of observations that correspond to $x = 1$.

Multinomial and Dirichlet Distribution

Let’s now consider a more complex situation, tossing a dice instead of flipping a coin. Things get complicated because there are potential six results now, they are 1, 2, 3, 4, 5, 6. For convenient we use 1-of-$K$ coding scheme, here the $K$ is 6, e.g. $ \text{x} = (0, 0, 1, 0, 0, 0)^T $. The distribution of $\text{x}$ with $K$ outcomes is:

$$
p(\text{x} | \mu) = \prod_{k = 1}^{K} {\mu_k^{x_k}}, \ \text{where} \ \forall k : \mu_k \geq 0 \wedge \sum_{k = 1}^{K} {\mu_k = 1}
$$

So this is the generalization of the Bernoulli distribution. $\mu_k$ is the probability of result $k$. Given a data set $\mathcal{D} = \{\text{x}_1, …, \text{x}_M\}$, the likelihood:

$$
p(\mathcal{D} | \mu) = \prod_{n = 1}^M {\prod_{k = 1}^K {\mu^{x_{nk}}}} = \prod_{k = 1}^K {\mu_k^{(\sum_n^M {x_{nk}})}} = \prod_{k = 1}^K {\mu_k^{m_k}}
$$

$m_k$ is the number of cases for which $\text{x}_n$ has output $k$. Multinomial distribution is the joint distribution over $m_1, m_K$ conditioned on $\mu$ and $M$:

$$
\begin{align}
\text{Mult} (m_1, …, m_K | \mu, M) & = \left( \begin{array}{c} M \\ m_1 … m_K \end{array} \right) \prod_{k = 1}^K {\mu_k^{m_k}} = \frac{M!}{m_1! … m_K!} \prod_{k = 1}^K {\mu_k^{m_k}} \\
\mathbb{E}[m_k] & = N \mu_k \\
\text{var}[m_k] & = N \mu_k (1 - \mu_k) \\
\text{cov}[m_j m_k] & = -N \mu_j \mu_k
\end{align}
$$

Therefore if we need to find a conjugate prior with the same form, the Dirichlet distribution is a good choice. It is the multivariate generalization of the Beta distribution:

$$
\begin{align}
\text{Dir} (\mu | \alpha) & = \frac{\Gamma(\alpha_0)} {\Gamma(\alpha_1) … \Gamma(\alpha_K)} \prod_{k = 1}^K {\mu_k^{\alpha_k - 1}}, \ \text{where} \ \alpha_0 = \sum_{k = 1}^K {\alpha_k} \\
\mathbb{E}[\mu_k] & = \frac{\alpha_k} {\alpha_0} \\
\text{var}[\mu_k] & = \frac{\alpha_k (\alpha_0 - \alpha_k)} {\alpha_0^2 (\alpha_0 + 1)} \\
\text{cov}[\mu_j \mu_k] & = - \frac{\alpha_j \alpha_k} {\alpha_0^2 (\alpha_0 + 1)}
\end{align}
$$

The posterior distribution over the parameters $\mu$:

$$
p(\mu | \mathcal{D}, \alpha) \propto p(\mathcal{D} | \mu) p(\mathcal{\mu | \alpha}) \propto \prod_{k = 1}^K {\mu_k^{\alpha_k + m_k - 1}}
$$

To form the “chain” of prediction, $\alpha_k$ should be interpreted as an effective number of observations of $x_k = 1$.

Gaussian, Gamma and Gaussian-Gamma Distribution

We have reviewed the discrete variable situation. That is, every variable has only $K$ potential values. Now let’s consider the continuous variable situation. We should have been familiar with the Gaussian distribution, so only its formula will be provided here.

$$
\begin{align}
\text{One-dimensional} & : & \mathcal{N} (x | \mu, \sigma^2) & = \frac{1} {\sqrt{(2 \pi)} \sigma} \exp{ \left( - \frac{(x - \mu)^2} {2 \sigma^2} \right) } \\
\text{Multi-dimensional} & : & \quad \mathcal{N} (\text{x} | \mu, \Sigma) & = \frac{1} {(2 \pi)^{D / 2} |\Sigma|^{1 / 2}} \exp{\left( - \frac{1} {2} (\text{x} - \mu)^T \Sigma^{-1} (\text{x} - \mu) \right)} \\
\end{align}
$$

To be noticed is that Gaussian function has two parameters, they are the mean $\mu$ and the variance/covariance matrix $\sigma^2/\Sigma$. So if we want to estimate a Gaussian distribution over some variables, there could be three different situations:

Variance/Covariance is known, but mean is unknown
Mean is known, but Variance/covariance is unknown
Neither mean nor variance/covariance is known

Now, given a set of one-dimensional continuous variables $\text{X} = (x_1, …, x_N)^T$, let’s discuss these three situations.

Variance is known, but mean is unknown

Assume $\sigma_{known}^2$ is known, we can easily give a Gaussian likelihood function for the unknown $\mu$:

$$
p(\text{X} | \mu) = \prod_{n = 1}^{N} {p(x_n | \mu)} = \frac{1} {(2 \pi \sigma^2)^{N / 2}} \exp \left( - \frac{1} {2 \sigma^2} \sum_{n = 1}^{N} {(x_n - \mu)^2} \right )
$$

For this case, the conjugate prior is also Gaussian:

$$
p(\mu | \mu_0, \sigma_0^2) = \mathcal{N} (\mu | \mu_0, \sigma_0^2)
$$

The posterior is:

$$
\begin{align}
p(\mu | \text{X}) \propto p(\text{X} | \mu) p(\mu | \mu_0, \sigma_0^2) & \propto \mathcal{N} (\mu | \mu_N, \sigma_N^2) \\
\text{where} \ \mu_N & = \frac{\sigma_{known}^2} {N \sigma_0^2 + \sigma_{known}^2} \mu_0 + \frac{N \sigma_0^2} {N \sigma_0^2 + \sigma_{known}^2} \mu_{ML} \\
\mu_{ML} & = \frac{1} {N} \sum_{n = 1}^{N} {x_n} \\
\frac{1} {\sigma_N^2} & = \frac{1} {\sigma_0^2} + \frac{N} {\sigma^2}
\end{align}
$$

Mean is known, but variance is unknown

Now assume $\mu_{known}$ is known, the precision $\lambda = 1 / \sigma^2$ needs to be inferred. The likelihood function is given by:

$$
p(\text{X} | \lambda) = \prod_{n = 1}^{N} {\mathcal{N} (x_n | \mu_{known}, \lambda^{-1})} \propto \lambda^{N / 2} \exp \left( - \frac{\lambda} {2} \sum_{n = 1}^{N} {(x_n - \mu_{known})^2} \right)
$$

$\mu_{known}$ is known, so the likelihood is in form of $\lambda^a \exp \left( - b \lambda \right)$, which is also the shape of Gamma function of $\lambda$:

$$
\begin{align}
\text{Gam} (\lambda | a, b) & = \frac{1} {\Gamma(a)} b^a \lambda^{a - 1} \exp \left( -b \lambda \right) \\
\mathbb{E} [\lambda] & = \frac{a} {b} \\
\text{var} [\lambda] & = \frac{a} {b^2}
\end{align}
$$

And posterior:

$$
p(\lambda | \text{X}) \propto \lambda^{a_0 - 1} \lambda^{N / 2} \exp \left( -b_0 \lambda - \frac{\lambda} {2} \sum_{n = 1}^N {(x_n - \mu_{known})^2} \right)
$$

For forming the chain, this could become the prior of the next iteration: $\text{Gam} (\lambda | a_N, b_N)$, with $ a_N = a_0 + \frac{N} {2} $ and $ b_N = b_0 + \frac{1} {2} \sum_{n = 1} {N} {(x_n - \mu_{known})^2} $.

Neither mean nor variance is known

The much more complex case is that both mean and variance are unknown. The likelihood function is:

$$
\begin{align}
p(\text{X} | \mu, \lambda) & = \prod_{n = 1}^N {\left( \frac{\lambda} {2 \pi} \right)^{1 / 2} \exp \left( - \frac{\lambda} {2 \pi} (x_n - \mu)^2 \right)} \\
& \propto \left( \lambda^{1 / 2} \exp \left( - \frac{\lambda \mu^2} {2} \right) \right)^N \exp \left( \lambda \mu \sum_{n = 1}^N {x_n} - \frac{\lambda} {2} \sum_{n - 1}^N {x_n^2} \right)
\end{align}
$$

The Gaussian-Gamma distribution can be utilized as conjugate prior here:

$$
\begin{align}
p(\mu, \lambda) & = \mathcal{N} (\mu | \mu_0, (\beta \lambda) ^{-1}) \text{Gam}(\lambda | a, b) \\
& \propto \exp \left( - \frac{\beta \lambda} {2} {(\mu - \mu_0)^2} \right) \lambda^{a - 1} \exp \left( -b \lambda \right)
\end{align}
$$

Multivariate cases

What we have just mentioned is only the single variate cases, what shall we do if the data have higher dimensional? The answer is similar to the single variate cases. But the Wishart distribution will be utilized here to act conjugate prior. The summary is here:

$\mu$ unknown, $\Lambda$ known ($\Lambda = \Sigma^{-1}$, $\Sigma$ is the covariance matrix): using the Gaussian as conjugate prior, $p(\mu)$
$\mu$ known, $\Lambda$ unknown: using the Wishart as conjugate prior, $p(\Lambda)$

$$
\mathcal{W} (\Lambda | \text{W}, \nu) = B |\Lambda|^{(\nu - \mathcal{D} - 1) / 2} \exp \left( - \frac{1} {2} \text{Tr} (\text{W}^{-1} \Lambda) \right)
$$
$\mu$ and $\Lambda$ unknown: using the Gaussian-Wishart as conjugate prior, $p(\mu, \Lambda)$

$$
p(\mu, \Lambda | \mu_0, \beta, \text{W}, \nu) = \mathcal{N} (\mu | \mu_0, (\beta \Lambda)^{-1}) \mathcal{W} (\Lambda | \text{W}, \nu)
$$

Student’s t-Distribution

Summary

Prior distributions and their conjugate prior.

Likelihood Function	Formula	Conjugate Prior	Formula	Posterior
Bernoulli	$ \mu^{x}(1 - \mu)^{1 - x} $	Beta	$ \frac{\Gamma (a + b)} {\Gamma (a) \Gamma (b)} \mu^{a - 1} (1 - \mu)^{b - 1} $	$ \frac{a_N}{a_N + b_N} $
Multinomial	$ \left( \begin{array}{c} M \\ m_1 … m_K \end{array} \right) \prod_{k = 1}^K {\mu_k^{m_k}} $	Dirichlet	$ \frac{\Gamma(\alpha_0)} {\Gamma(\alpha_1) … \Gamma(\alpha_K)} \prod_{k = 1}^K {\mu_k^{\alpha_k - 1}} $	$ \prod_{k = 1}^K {\mu_k^{\alpha_k + m_k - 1}} $
Gaussian, 1-dim, $\mu$ unknown	$ \prod_{n = 1}^{N} {p(x_n \mid \mu)} $	Gaussian	$ \mathcal{N} (\mu \mid \mu_0, \sigma_0^2) $	$ \mathcal{N} (\mu \mid \mu_N, \sigma_N^2) $
Gaussian, 1-dim, $\sigma$ unknown	$ \prod_{n = 1}^{N} {\mathcal{N} (x_n \mid \mu, \lambda^{-1})} $	Gamma	$ \text{Gam} (\lambda \mid a_0, b_0) $	$ \text{Gam} (\lambda \mid a_N, b_N) $
Gaussian, 1-dim, both unknown	$ \prod_{n = 1}^{N} {\mathcal{N} (x_n \mid \mu, \lambda^{-1})} $	Gaussian-Gamma	$ \mathcal{N} (\mu \mid \mu_0, (\beta \lambda) ^{-1}) \text{Gam}(\lambda \mid a_0, b_0) $	…

Reference

[1] Conjugate prior
[2] Bishop, Christopher M. Pattern recognition and machine learning. springer, 2006.
[3] “Beta distribution pdf” by Horas based on the work of Krishnavedala
[4] Bernoulli Distribution
[5] Binomial Distribution
[6] Beta Distribution
[7] Multinomial Distribution
[8] Dirichlet Distribution
[9] Gaussian Distribution
[10] Gamma Distribution
[11] Gaussian-Gamma Distribution
[12] Wishart Distribution
[13] Gaussian-Wishart Distribution

Likelihood Function	Formula	Conjugate Prior	Formula	Posterior
Bernoulli	\( \mu^{x}(1 - \mu)^{1 - x} \)	Beta	\( \frac{\Gamma (a + b)} {\Gamma (a) \Gamma (b)} \mu^{a - 1} (1 - \mu)^{b - 1} \)	\( \frac{a_N}{a_N + b_N} \)
Multinomial	\( \left( \begin{array}{c} M \\ m_1 … m_K \end{array} \right) \prod_{k = 1}^K {\mu_k^{m_k}} \)	Dirichlet	\( \frac{\Gamma(\alpha_0)} {\Gamma(\alpha_1) … \Gamma(\alpha_K)} \prod_{k = 1}^K {\mu_k^{\alpha_k - 1}} \)	\( \prod_{k = 1}^K {\mu_k^{\alpha_k + m_k - 1}} \)
Gaussian, 1-dim, \(\mu\) unknown	\( \prod_{n = 1}^{N} {p(x_n \mid \mu)} \)	Gaussian	\( \mathcal{N} (\mu \mid \mu_0, \sigma_0^2) \)	\( \mathcal{N} (\mu \mid \mu_N, \sigma_N^2) \)
Gaussian, 1-dim, \(\sigma\) unknown	\( \prod_{n = 1}^{N} {\mathcal{N} (x_n \mid \mu, \lambda^{-1})} \)	Gamma	\( \text{Gam} (\lambda \mid a_0, b_0) \)	\( \text{Gam} (\lambda \mid a_N, b_N) \)
Gaussian, 1-dim, both unknown	\( \prod_{n = 1}^{N} {\mathcal{N} (x_n \mid \mu, \lambda^{-1})} \)	Gaussian-Gamma	\( \mathcal{N} (\mu \mid \mu_0, (\beta \lambda) ^{-1}) \text{Gam}(\lambda \mid a_0, b_0) \)	…