Introduction to Variational Bayes

Posted on July 31, 2013 by thiagogm

Assume we have data ${Y = \{y_1, ..., y_{n_d}\}}$ and a parameter vector ${Z = \{X, \theta\}}$ , where ${X = \{x_1, ..., x_n\}}$ is formed by latent (non-observed) variables and ${\theta = \{\theta_1, ..., \theta_m\}}$ are possible hyperparameters, usually connected to the likelihood and/or the distribution of the latent variables ${X}$ . A Bayesian model specifies the joint distribution ${p(Y, Z)}$ and our main goal is to compute the posterior distribution of the model parameters ${p(Z|Y)}$ and the marginal distribution of data (or model evidence) ${p(Y)}$ . In practice, those quantities are rarely available in closed form, which asks for some kind of numerical approximation. One of the many approximation methods used in this context is called Variational Bayes.

Variational Bayes

Lets introduce a distribution ${q(Z)}$ defined over the parameters of the model ${Z}$ . For any choice of ${q(Z)}$ , the following equation holds (see for example Section ${9.4}$ of [1]):

$\displaystyle \ln p(Y) = \mathcal{L}(q) + \text{KL}(q||p), \ \ \ \ \ (1)$

where

$\displaystyle \mathcal{L}(q) = \int q(Z) \ln \bigg\{\frac{p(Y, Z)}{q(Z)}\bigg\}dZ, \ \ \ \ \ \ (2)$

$\displaystyle \text{KL}(q||p) = - \int q(Z) \ln\bigg\{\frac{p(Z|X)}{q(Z)}\bigg\}dZ.$

${\text{KL}(q||p)}$ is the Kullback-Leibler divergence [2] between ${q(Z)}$ and the posterior distribution ${p(Z|Y)}$ . Since ${\text{KL}(q||p) \geq 0}$ , it follows from Eq. (1) that ${\mathcal{L}(q) \leq \ln p(Y)}$ . That is, ${\mathcal{L}(q)}$ is a lower bound on ${\ln p(Y)}$ . Note also that ${\text{KL}(q||p) = 0}$ if and only if ${q(Z) = p(Z|Y)}$ almost everywhere.

Now, we can maximize the lower bound ${\mathcal{L}(q)}$ by optimization with respect to the distribution ${q(Z)}$ (hence the name variational, see note below), which is equivalent to minimizing ${\text{KL}(q||p)}$ .

Note: The term variational comes from the calculus of variations, which is concerned with the behavior of functionals. Functions map the value of a variable to the value of the function. Functionals map a function to the value of the functional. ${\mathcal{L}(q)}$ , for example, is a functional that takes the function ${q(\cdot)}$ as input.

Is Variational Bayes an exact or an approximate method?

If we allow any possible choice of ${q(Z)}$ when optimizing ${\mathcal{L}(q)}$ , then the maximum of the lower bound occurs when the KL divergence ${\text{KL}(q||p)}$ vanishes, which occurs when ${q(Z)}$ equals the posterior distribution ${p(Z|Y)}$ , and variational Bayes would then give an exact result.

However, maximizing ${\mathcal{L}(q)}$ over all possible choices of ${q(Z)}$ is not feasible. Therefore, we usually impose some restriction on the family of distributions ${q(Z)}$ considered in the optimization. The goal is to restrict the family sufficiently so that computations are feasible, while at the same time allowing the family to be sufficiently rich and flexible that it can provide a good approximation to the true posterior distribution.

Parametric approximation

One way to restrict the family of approximating distributions is to use a parametric distribution ${q(Z|\omega)}$ , like a Gaussian distribution for example, governed by a set of parameters ${\omega}$ . The lower bound ${\mathcal{L}(q)}$ then becomes a function of ${\omega}$ , and we can exploit standard nonlinear optimization techniques to determine the optimal values for the parameters.

Factorized approximation

A different restriction is obtained by partitioning the elements of ${Z}$ into disjoint groups that we denote by ${Z_i}$ , where ${i = 1, ..., M}$ . We then assume that ${q}$ factorizes with respect to these groups, so that

$\displaystyle q(Z) = \prod _{i=1}^{M}q_i(Z_i). \ \ \ \ \ (3)$

It should be emphasized that we are making no further assumptions about the distribution. In particular, we place no restriction on the functional forms of the individual factors ${q_i(Z_i)}$ . We now substitute Eq. (3) into Eq. (2) and optimize ${\mathcal{L}(q)}$ with respect to each of the factors in turn.

It can be shown (see Section ${10.1.1}$ of [1]) that the optimal solution ${q_j^*(Z_j)}$ for the factor ${j}$ is given by

$\displaystyle \ln q_j^*(Z_j) = E_{i \neq j}[\ln p(Y, Z)] + \text{const}, \quad j = 1, ..., M \ \ \ \ \ (4)$

where the expectation ${E_{i \neq j}}$ is taken with respect to all other factors ${q_i}$ such that ${i \neq j}$ .

The set of equations in (4) does not represent an explicit solution to our optimization problem, since the equation for a specific ${j}$ depends on expectations computed with respect to other factor ${q_i}$ for ${i \neq j}$ . We then solve using an iterative method, by first initializing all the factors appropriately and cycling through the factors using updated solution from previous factors in the cycle. Convergence is guaranteed because the bound is convex with respect to each of the factors ${q_i(Z_i)}$ .

As noted by [1], a factorized variational approximation tends to under-estimate the variance of the posterior distributions. Technically, this happens because we are trying to minimize ${\text{KL}(q||p)}$ , which is zero forcing in the sense that it forces ${q(z) = 0}$ for every value of ${z}$ where ${p(z|Y) = 0}$ , and typically ${q(Z)}$ will under-estimate the support of ${p(Z|Y)}$ and will tend to seek the mode with the largest mass in case of a multi-modal posterior distribution.

In future posts, I will try to illustrate different implementations of Variational Bayes in practice.

References:

[1] Bishop, C. M. (2006). Pattern recognition and machine learning. New York: springer.
[2] Kullback, S., Leibler, R. A. (1951). On information and sufficiency. The Annals of Mathematical Statistics, 22(1), 79-86.

Further reading:

– Ormerod, J.T., Wand, M.P. (2010). Explaining variational approximations. The American Statistician, 64(2).

Using Dropbox as a private git repository

Posted on July 24, 2013 by thiagogm

I have been struggling to find a cheap way to have many private git repositories for quite some time now. Github is nice to keep all your public projects but charges you a reasonable amount per month if you want to have 50 private repositories. Then I thought, why not use Dropbox as a place to hold my private repositories?!? That sounds good, and after a quick search I found some websites explaining how to do it in less than 1 minute. I have followed the instructions written by Maurizio Turatti and Jimmy Theis, which I reproduce here in a more compact form. The following assumes you want to create a private repository located at ~/Dropbox/git/sample_project.git for your local project located at ~/projects/sample_project.

cd ~/Dropbox

# Dropbox/git Will hold all your future repositories
mkdir git

# creates a sample_project.git folder to act as
# a private repository
cd git
mkdir sample_project.git
cd sample_project.git

# Initialize your private repository. Using --bare
# here is important.
git init --bare

# Initialize git in your local folder
cd ~/projects/sample_project
git init

# Show the path of your private repository
git remote add origin ~/Dropbox/git/sample_project.git

# Now you can use git as usual
git push origin master

After that you can work from your local folder and use push, pull, clone as if you were using Github. You can now share your Dropbox folder with your collaborators so that they can do the same. Just be careful if you use this with many collaborators since I believe conflict might happen if you both push content at the same time. So far I have used this solution to host my private projects that I am working on my own.

To test if everything is working try

git clone ~/Dropbox/git/sample_project.git ~/test

and check if the content of your project is in ~/test.

6th week of Coursera’s Machine Learning (advice on applying machine learning)

Posted on July 17, 2013 by thiagogm

The first part of the 6th week of Andrew Ng’s Machine Learning course at Coursera provides advice for applying Machine Learning. In my opinion, this week of the course was the most useful and important one, mainly because the kind of knowledge provided is not easily found on textbooks.

It was discussed how you can diagnose what to do next when you fit a model to the data you have in hand and discover that your model is still making unacceptable errors when you test it under an independent test set. Among the things you can do to improve your results are:

Get more training examples
Try smaller sets of features
Get additional features
Fit more complex models

However, it is not easy to decide which of the points above will be useful to your particular case without further analysis. The main lesson is, contrary to what many people think, getting more data or fitting more complex models will not always help you to get better results.

Training, cross-validation and test sets

Assume we have a model with parameter vector ${\theta}$ . As I have mentioned here, the training error, which I denote from now on as ${J_{train}(\theta)}$ , is usually smaller than the test error, denoted hereafter as ${J_{test}(\theta)}$ , partly because we are using the same data to fit and to evaluate the model.

In a data-rich environment, the suggestion is to divide your data-set in three mutually exclusive parts, namely training, cross-validation and test set. The training set is used to fit the models; the (cross-) validation set is used to estimate prediction error for model selection, denoted hereafter as ${J_{cv}(\theta)}$ ; and the test set is used for assessment of the generalization error of the final chosen model.

There is no unique rule on how much data to use in each part, but reasonable choices vary between ${50\%}$ , ${25\%}$ , ${25\%}$ and ${60\%}$ , ${20\%}$ , ${20\%}$ for the training, cross-validation and test sets, respectively. One important point not mentioned in the Machine Learning course is that you are not always in a data-rich environment, in which case you cannot afford using only ${50\%}$ or ${60\%}$ of your data to fit the model. In this case you might need to use cross-validation techniques or measures like AIC, BIC and alike to obtain estimates of the prediction error while retaining most of your data for training purposes.

Diagnosing bias and variance

Suppose your model is performing less well than what you consider acceptable. That is, assume ${J_{cv}(\theta)}$ or ${J_{test}(\theta)}$ is high. The important step to start figuring out what to do next is to find out whether the problem is caused by high bias or high variance.

– Under-fitting vs. over-fitting

In a high bias scenario, which happens when you under-fit your data, the training error ${J_{train}(\theta)}$ will be high, and the cross-validation error will be close to the training error, ${J_{cv}(\theta) \approx J_{train}(\theta)}$ . The intuition here is that since your model doesn’t fit the training data well (under-fitting), it will not perform well for an independent test set either.

In a high variance scenario, which happens when you over-fit your data, the training error will be low and the cross validation error will be much higher than the training error, ${J_{cv}(\theta) >> J_{train}(\theta)}$ . The intuition behind this is that since you are overfitting your training data, the training error will be obviously small. But then your model will generalize poorly for new observations, leading to a much higher cross-validation error.

– Helpful plots

Some plots can be very helpful in diagnosing bias/variance problems. For example, a plot that map the degree of complexity of different models in the x-axis to the respective values of ${J_{train}(\theta)}$ and ${J_{cv}(\theta)}$ for each of these models in the y-axis can identify which models are suffering from high bias (usually the simpler ones), and which models are suffering from high variance (usually the more complex ones). Using this plot you can pick the one which minimizes the cross validation error. Besides, you can check the gap between ${J_{train}(\theta)}$ and ${J_{cv}(\theta)}$ for this chosen model to help you decide what to do next (see next section).

Another interesting plot is to fit the chosen model for different choices of training set size, and plot the respective ${J_{train}(\theta)}$ and ${J_{cv}(\theta)}$ values on the y-axis. In a high bias environment, ${J_{train}(\theta)}$ and ${J_{cv}(\theta)}$ will gradually converge to the same value. However, in a high variance environment there will still be a gap between ${J_{train}(\theta)}$ and ${J_{cv}(\theta)}$ , even when you have used all your training data. Also, if you note that the gap between ${J_{train}(\theta)}$ and ${J_{cv}(\theta)}$ is decreasing with increasing number of data points, it is an indication that more data would give you even better results, while the case where the gap between them stop decreasing at some specific data set size shows that collecting more data might not be what you should concentrate your focus on.

Once you have diagnosed if your problem is high bias or high variance, you need to decide what to do next based on this piece of information.

What to do next

Once you have diagnosed whether your problem is high bias or high variance, it is time to decide what you can do to improve your results.

For example, the following can help improve a high bias learning algorithm:

getting additional features (covariates)
adding polynomial features ( ${x_1^2}$ , ${x_2^2}$ , …)
fitting more complex models (like neural networks with more hidden units/layers or smaller regularization parameter)

For the case of high variance, we could:

get more data
try smaller sets of features
use simpler models

Conclusion

The main point here is that getting more data, or using more complex models will not always help you to improve your data analysis. A lot of time can be wasted trying to obtain more data while the true source of your problem comes from high bias. And this will not be solved by collecting more data, unless other measures are taken to solve the high bias problem. The same is true when you find yourself trying to find more complex models while the current model is actually suffering from a high variance problem. In this case a simpler model, rather than a more complex one, might be what you look for.

References:

– Andrew Ng’s Machine Learning course at Coursera

Thiago G. Martins

Data Scientist at Yahoo!

Monthly Archives: July 2013