# Getting to the Mixtures in Dirichlet Process Mixture Models

The Dirichlet Process is pretty sweet as Edwin Chen and numerous others have pointed out. When combined with a mixture of Bayesian models, it’s really effective for finding an accurate number of models needed to represent your bundle of data, especially useful when you have no idea how many bundles of related words you may have. Edwin Chen does a really great job of explaining how the Dirichlet Process works, how to implement it, and how it can break down the McDonald’s menu, but he leaves out details on how to combine it with mixture models, which is by far the coolest part.

The short description for how to merge the Dirichlet Process with Nonparametric Mixture Modeling (and infer the model) looks like this:

1. Define probability distributions for your mixture model
2. Rewrite your distributions with Bayes rule so that parameters depend on data and hyper parameters
3. Repeat step 2 for hyper parameters as desired (it’s turtles all the way down, as Kevin Knight said)
4. For each data point
1. Forget about the current data point
2. Compute the likelihood that each mixture made this data point
3. Sampling a new label for this data point using these likelihoods
4. Jot down this new label
5. Remember this data point
5. After seeing all the data, re-estimate all your parameters based on the new assignments
6. Repeat steps 4 and 5 ad nausea.

In short, this is the Gibbs Sampling approach. If you already know what your distributions look like, and this paltry description made perfect sense, Hurray! Please go bake a cake for everyone that feels a little underwhelmed with this description.

Now that someone is off baking us cake, let’s dive down into what the distributions look like. Taking a look at a number of papers describing variations of the Dirichlet Process Mixture Model, like The Infinite Gaussian Mixture Model and Variational Inference for Dirichlet Process Mixture Models, they first make things seem pretty straightforward. To start, they generally throw down this simple equation:

Simple, no? If you want mixtures of Gaussian models, then $\theta_j$ has a mean and a variance, we’ll call these $\mu_j$ and $\sigma_j2$. The next step is then to rewrite this equation in a couple of ways so that you define the parameters, i.e. $\theta_j$ in terms of the data and other parameters. One example: $\mu_j$ looks like this:

Not too bad after reading through the definition for all the parameters; the first bundle describes the mean of the mixture and the second bundle describes the variance of the mixture. You may have noticed that our means for each mixture come from a Gaussian distribution themselves (I did say it was turtles all the way down). If you’re using a nice math coding environment, like Scalala and Scalanlp, you can easily sample from this distribution like this

This goes on for most of the parameters for the model until you get to one of the core tear inducing rewrites. This lovely gem describes the probably distribution function for one of the hyper parameters in the Infinite Gaussian Mixture Model:

This is not easy to sample from especially if you’re new to sophisticated sampling methods. And sadly, just about every academic publication describing these kinds of models gets to a point where they assume you know what they’re talking about and can throw down Monte Carlo Markov Chain sampling procedures like they’re rice at a wedding.

So what’s a lowly non-statistician data modeller to do? Start with a well written description of a similar, but significantly simpler model. In this case, Gibbs Sampling for the Uninitiated, which describes how to apply Gibbs to the Naive Bayes model. I’ll summarize this awesome paper in the next part of this series and then tweak it a little to make a Finite Gaussian Mixture Model.
After that’s said and done, I’ll show how to extend the finite version into the Dirichlet Process Mixture Model (with some simplifications).