# Papers I’ve Really Enjoyed Reading

As any good grad student would, I’ve read a lot of research papers. And my reading interest have swayed far and wide across the Natural Language Processing spectrum and even expanded into more general Machine Learning methods and Bayesian modelling. Now that i’m transitioning from a full time grad student focused on research to being a full time software developer at Google, I want to just briefly jot down a list of papers that have stuck out the most for me this last week as I transition. So in no particular order, and in no way totally inclusive, here they are:

• Hierarchical Dirichlet Processes: This paper really helped me figure out both how Dirichlet Processes worked, the multiple interpretations of these models, and inference procedures. On top of that, it introduced an even cooler hierarchical extension to Dirichlet Processes. A must read for non-parametric Bayesian models.
• Particle Learning for General Mixtures: I still don’t fully understand how to design good particle filters, but this paper pushed my mind in a lot of ways and got me excited about Sequential Markov Models quite a while ago. It’s event got a cool surprise: It’s written by people in Business schools!
• The Curious Case of Behavioral Backlash: Why Brands Produce Priming Effects and Slogans Produce Reverse Priming Effects: I in generally love anything to do with psychical priming, so this paper caught my attention when I came across it. And the horribly long title says what it’s all about. They go through a battery of experiments testing the impact upon people created by both branding and slogans. Their results are kinda surprising between
• Cross-Cutting Models of Lexical Semantics: This paper is both really simple and really complicated at the same time. They glue together two really great ideas: Latent Dirichlet Allocation and Dirichlet Process Mixture Models to build a nicely thought out Cross Cutting, or multiple clustering under multiple views, model for processing text documents. I only wish I’d thought of this idea first.
• An Introduction to MCMC for Machine Learning: If you love topic models, Dirichlet Process Mixture Models, or any other sophisticated Bayesian model, then this paper is a must read. Written by some of the big hitters in MCMC learning, they cover all the major Monte Carlo Markov Chain approaches in sweet sweet detail.
• The Infinite Gaussian Mixture Model: Yet another paper based on Dirichlet Process Mixture Models. It’s a totally non-parametric Bayesian clustering model with deep layers of hierarchy to smooth out the mixtures nicely.
• A Nonparametric Bayesian Model for Multiple Clusterings with Overlapping Feature Views: This was my first introduction to cross cutting clustering models. They take a totally different approach than the original multi-clustering approach and but as complicated as it looks at first, it makes a nice amount of sense after a while.
• Semantic Taxonomy Induction from Heterogeneous Evidence: Yet another paper I wish I’d though of first. How do you extend an ontology automatically? You just use the knowledge you already have encoded to train a classifier for the kind of relationships you want to add and let it fly over new data. Final product? A bigger and better taxonomy with oodles of new information.
• Personalizing PageRank for Word Sense Disambiguation: This is one of the earlier really great graph based and fully unsupervised Word Sense Disambiguation papers. It glues together the magic of PageRank and intuitions about word senses to get a general model that’s pretty hard to beat event today.
• An Experimental Study of Graph Connectivity for Unsupervised Word Sense Disambiguation: Another really good paper covering all the ways you can build fully unsupervised Word Sense Disambiguation models using graphs of word senses. The best of the models used still get used pretty often in WSD.
• Improving Word Representations via Global Context and Multiple Word Prototypes: Andrew Socher is a badass. Simple as that. I never though I’d see a totally legitimate and awesome lexical semantics model use a neural network, but hey made it work beautifully. It’s a good competitor to the Cross-Cutting model of semantics too.
• A probabilistic model of cross-categorization: Probably the best starting point for learning about cross categorization (cross-cutting) models. It gives a good bit of theory of how they work, how to build them, and a huge swath of examples of how they can be applied to approximate human decision making. Really impressive stuff.

# Finding Events During the Olympics

I needed the event times for every game, tournament, and competition that happened during the London 2012 Olympics. Thankfully the London 2012 website posts all this information online. Sadly, this information isn’t posted anywhere in an easy to utilize format. I was deeply hoping that there would be a csv file describing when every event took place, but to no avail. Instead, all I had access to was beautifully rendered web pages like below, which graphically detail when each event began and end for each day.

This handy web page has all the information I need, but how easy is it to get that information out?

## The raw event information

Well, let’s take a gander at the html used to describe each event:

It’s pretty obvious when the event took place no?

Not really. They’ve crammed all the timing information into the second div element using two different attributes: class and data-type. Furthermore, they’ve also encoded the start time within a deeply nested span. But it gets better. They use two different time zones and two different date formats, the class attribute uses GMT+01:00 using a pretty readible format while the data-type uses GMT+00:00 with a horribly mashed together format. But at least, it’s a regular pattern. So we should be able to get everything we want by treating this as xml and just extracting the attribute values for the nodes we care about.

## The obvious way to extract those times

XML parsing is both pretty fast and easy, if you select the right language. Let’s throw this into a simple xml parser and see what happens:

Parsing…parsing…parsing…

So, it looks like XML parsing this beast is both slow and higly error prone. What else could we do? We could try using a more accurate or graceful xml parser that’s built to deal with html’s malformities, but I’ll leave out story of how that doesn’t work well in this case either. Instead, we’ll do something even more silly.

## In rolls javascript

Obviously someone code path out in the world has to be capable of handling this html mess. Otherwise it’d never show in a browser. But what’s the shortest path for doing this? Hacking through either the Firefox source code or the Chromium source code both sound like longest path solutions to me. In comes the magic of Javascript. jQuery is able to single handedly take an html file, turn it into a Document Object Model structure, and let you run queries over it, not only does this work in a few lines of code, it’s super fast.

### A quick setup

But, first you have to do a little setup to run all this outside of a browser. The easiest system I found was node.js. To setup, all you have to do is grab a tarball, I used this one, untar it, and added the binaries to your path:

wget http://nodejs.org/dist/v0.8.7/node-v0.8.7.tar.gz
tar -xvzf node-v0.8.7.tar.gz
export $PATH:pwd/node-v0.8.7-linux-x86/bin/node export$PATH:pwd/node-v0.8.7-linux-x86/bin/npm


And to get jQuery setup for node.js, it’s as easy as running:

npm install jquery


Only hitch is that you have to run this from the root directory where you’ll be running your processing, but that’s a minor problem.

### Parsing Power!

Now that our javascript runtime environment is setup with the library we need, it’s time to get into the meat of the code. The approach is pretty simple:

1. parse a html file using jquery
2. find the nodes with the bar id nested within a div having a disciplines id.
3. Get the class and data-type attributes from the div we got back and handle one of three cases:
1. The class attritube tells us all we need
2. class and data-type include timing information
3. the data-type attribute tells us all we need.

So let’s code that up! First, import the js libraries you need.

Now parse the javascript to get a DOM structure:

So far so easy, breezy. Now comes the challenging part: grabbing the div’s holding the timing info we need and checking for the funny ways london 2012 decided to encode that information:

And that’s it. With jQuery and node.js, I have significantly more comments than actual code, mostly to describe my new understanding of the api’s and data based logic processing that no human should have to decifer through code.

## Actually getting the data

Actually downloading the data requires a few simple tricks that are worth noting, just in case anyone ever wants to duplicatet this whole shebang.

First, london 2012 doesn’t let you do a normal wget call, it requires a user agent. So, every request looks like this:

userAgent="Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.0.3) Gecko/2008092416 Firefox/3.0.3"
wget --user-agent=$userAgent$url


Now we just need to figure out the pattern for each sport. For once, london 2012 makes that part easy. Their uer’s use just two key values: the sport you want and the day you want, so here’s a little bit of bash to extract all that (and save everything in an easier to use file name:

sports="judo gymnastics-artistic fencing tennis archery"
julyDays=seq  25 31 | while read day; do echo $day-july; done augDays=seq 1 12 | while read day; do echo$day-august; done
userAgent="Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.0.3) Gecko/2008092416 Firefox/3.0.3"
resultDir=tmpHtmlData

# Define a helper function to return the full url based on the sport and day.
# This is just to make changing the url easier.
function olympicsUrl() {
echo http://www.london2012.com/$sport/schedule-and-results/day=$day/all-day.html
}

# Now iterate through every sport.
for sport in $sports; do # And every day of the olympics. for day in$julyDays $augDays; do # And do a simple wget request. Since each page has the same html file # name, the -O bit saves the html output to a file based on the sport # and day for easier management. wget --user-agent=$userAgent -O $tmpHtmlData/$sport-$day.html olympicsUrl done done  And voila, you have all the sports you want for each day of the olympics. You can no easily power through all those files with this dandy bit of code: for sport in$sports; do
for html in $resultDir/*$sport*.html; do
node src/main/javascript/ExtractTimes.js $html done > olympics.$sport.times.txt
done


# Building a Phrase Graph

Research papers. I hate them sometimes. They present a great idea, talk about how it can be used and applied, and then give only the barest description of how to actually build and implement the idea, often with no pointers or links to what they built. My current frustration is with building a Phrase Graph. The idea behind phrase graphs are pretty simple, they encode a large set of sentences with a minimal automata.

Here’s a simple example. Say you have the following two sentences:

#archery by the Republic of Korea and the guy is legally blind.
#archery by the Republic of Korea and the guy who is legally blind.
#archery by the Republic of Korea in archery by a guy who is legally blind.


There’s quite a lot of overlap at the start of the sentence and at the end of the sentence. So a good phrase graph would look something like this:

So you can see that the nodes in the graph represent words found in the sentences observed and if you weight the nodes based on how often they are traversed, you can start to detect which phrases are used most frequently. But how does one do this? And how does one do this efficiently? This is where research papers make me mad. They fail to point out the simplest algorithms for building these awesome ideas. To complement these research ideas, this post’ll give a little more detail on what these phrase graphs are, an easy way to build them using existing libraries, and code to write your own custom phrase graph!

## Badly described methods for building a phrase graph

The first paper I read for building phrase graphs, titled Summarizing Sporting events using Twitter, gives this highly detailed algorithm description:

The phrase graph consists of a node for each word appearing in any status update, and an edge between each set of two words that are used adjacently in any status update

Seems easy to implement, no? Here’s a more detailed algorithm, found in Experiments in Microblog Summarization:

To construct the left-hand side, the algorithm starts with the root node. It reduces the set of input sentences to the set of sentences that contain the current node’s phrase. The current node and the root node are initially the same. Since every input sentence is guaranteed to contain the root phrase, our list of sentences does not change initially. Subsequently, the algorithm isolates the set of words that occur immediately before the current node’s phrase. From this set, duplicate words are combined and assigned a count that represents how many instances of those words are detected. For each of these unique words, the algorithm adds them to the graph as nodes with their associated counts to the left of the current node.

This gives a lot more detail on what the phrase graph contains, and an easy enough algorithm, but it’s not exactly a fast algorithm, especially if you want to do this using 10 million tweets about the Olympics. Both descriptions leave out a key detail: these phrase graphs are really just Compressed Tries.

## Tries and their compressed cousins

Tries are one of the simplest data structures, and one of the most powerful when processing natural languages. Given a set of words or sentences, a Trie is essentially a standard tree where the leaves represent observed words or sentences. The power of it is that each internal node in the Trie represents overlapping sequences. So if you want to build a Trie for an English Dictionary, the root node would be a blank character, which then points to a node for each letter of the alphabet. From the “a” child, you would then have access to all words starting with “a”, and the further down the Trie you go, you get longer prefixes of words.

Now a Phrase Graph is essentially a Trie which condenses not only shared prefixes, but also any shared subsequence, be they in the middle, or the end. Formally, they are directed acyclic graphs, and if they are treated as a lookup structure, ala a dictionary, they are called Minimal Acyclic Finite-State Automata. And there’s plenty of fast and simple ways to build these things. The easiest places to start reading about these is Incremental Construction of Minimal Acyclic Finite State Automata. The Brics Automaton package also provides a really good implantation for these that works

The above code snippet will generate this simple minimal automata:

*Note, you may want to open this in a new tab to zoom in as every letter has it’s own state.

## Rolling your own automata builder

Using Brics works really well if you just want to check whether or not a sentence matches one seen in a corpus. However, it doesn’t easily let you check how often particular sub-phrases are used within the corpus. For that kinda power, you’ll have to craft your own implemenation. And now it’s time to share the very code to do this!

### Nodes in the graph, that’s where.

Where does one start? First, you need a node data structure, with some very carefully crafted functions to determine equality (which indidentally, the research papers don’t point out).

This PhraseNode has three fairly simple data members, a label that records which word the node represents, the weight of the node, and a map from this node to it’s children based on their labels. The tricky part of this node is how you determine equality. Two nodes can be equal in two different senses: 1) they are the exact same data structure in memory, and so their memory locations are the same or 2) they have the same label and point to the same exact children in memory. Checking the first type of equality is easy, you can compare the hash code of their addresses using the default hashCode method java provides for every object. Checking the second form of equality is more challenging to do efficiently. The naive way would be to recursively check that all children eventually point to sub-graphs with the same structure. However, checking the hash code of the pointers of each children is much faster and accomplishes the same goal. Hence, this is why we override hashCode and equals with such complicated code.

### Linking together those Phrase Nodes

The algorithm for linking together PhraseNodes such that they form a minimal transducer relies on a few interesting tricks and beautiful recursion. The first trick we need is lexicographically sorted input. By sorting the input, you’re maximizing the size of matching prefixes between any neighboring words you want to put in the transducer. So let’s look at how we do that adding in sorted order.

Before we get there though, let’s flesh sketch out a CondensedTrie data structure. It’s pretty simple. It starts off with just having a single PhraseNode element, the root.

Now comes adding entries. Since we want a clean and easy to use interface, we’ll be defensive and assume the elements aren’t sorted, but they are already tokenized, so each element in the given list is a sequence of tokens. How you sort thoes beasts is a homework assignment.

These methods are mostly simple. addSuffix starts adding links to nodes starting from some initial point, note that nodes will automatically create a new node for a word if one doesn’t already exist. computeDeepestCommonNodeAndSuffix walks down the Trie starting at the root consuming each token that has a node and returns the deepest node reachable, i.e. finds the node with the longest common prefix with a given sequence of tokens. Finally adding a single tweet depends on getting the prefix, doing some magic called replaceOrRegister and then adding the suffix to the last node in the longest prefix. So, only question left, what is this registry business?

The registry keeps track of all nodes in the graph after they’ve been validated. And what does validation entail? It involves checking wether or not an existing node already exists in the registry. If one does, you simply replace that duplicate node with the one in the registry. If no such node exists, in goes the node. And this is exactly what replaceOrRegister does. To do this efficiently and correctly, we call replaceOrRegister on the last node in our comment prefix and walk all the way down along the most recently added path, i.e. the added by the last element we added, and then zip up any matching nodes which correspond to matching suffixes. By starting at the bottom, we match together end points which have no children and merge them.

Take our archery example above, all three sentences end with “is legally blind.” After we add the first sentence, there would be a node for each token in the order of the sentence. When we add the second sentence and walk down to the end, we see that “blind.” has a duplicate, which we can merge. Taking one step backwards, we’ll see that “legally” also has an exact match, where two nodes with the same label point to the same exact node, the node we just merged. And then thanks to recursion, we keep zipping things along until we get to “who”, which has no exact match, and we can stop zipping. Walking through an example like this should make the code below a little clearer.

And Voila, we now have all the code needed to make a phrase graph!

### Tweaking the automata to condense more phrases

BUT! Suppose you want something more minimal? Suppose you think it’s kinda funny that interjection of “who” prevents “guy SOMETHING OR NOTHING is” from being a phrase. Or you try adding in the sentence

#archery ZOIDBERG by the Republic of Korea and the guy who is legally blind.


and notice how it creates an entirely new branch for “the Republic of Korea” starting at “ZOIDBERG”, thus making the number of times you think you’ve seen that phrase dependent on the previous tokens. Can we fix this? YES! All we have to do is relax our definition of finding a matching element in the registry to finding a node whose outgoing links are a superset of the most recently added children.

And since Scala is awesome, we can do this with minimal effort.

All we had to do was update the contractor to take in a boolean, then create a new data member that links to one of two comparison functions for pairs of nodes: 1) an exact matching function, which we would use for a true compressed trie and 2) a subset matching function, to get our even more compressed sorta-trie. If we swap in subsumeMatch, we now get this phrase graph:

Let’s see how the two versions handle this as input:

Republic of Korea in archery by a guy who is legally blind
#archery by the Republic of Korea and by the guy is legally blind
#archery by the Republic of Korea and the guy is legally blind
#archery by the Republic of Korea in archery by a guy who is legally blind
#archery zoidberg by the Republic of Korea and by the guy is legally blind
#archery zoidberg by the Republic of Korea in archery by a guy who is legally blin


Using exact Matching:

Finally! This second version is precisely the data structure those three original papers were describing.

# Building Visualizations to Test Summarizations

I’m currently working on interesting on-line methods for summarizing streams of documents. The basic idea is that documents come hurtling into your inbox at a startling rate and you’d like a quick, easy, online method to summarize that they’re about. A lot of approaches to text summarization use an offline approach, meaning that those methods inspect all the documents. That’s not practical, especially if you want to, oh say, do this thing on all the tweets about the ongoing 2012 Olympics in London. So my goal is to work up a good enough algorithm for doing this process completely online. Even though it sadly won’t be working well enough to actually run online while the Olympics is going on (I’m still working on said algorithm), it could be pretty cool.

However, figuring out if you’re doing something right or wrong on many million tweets about 50 different sports is kind of challenging. So while i’m gathering ton of data to process, and then processing it all, I figured I should design a night UI for exploring the results. Being a terrible UI guy I thought I could never pull it off, but thanks to the magicians behind Crossfilter and D3.js, it turned out to be pretty easy. The result of my UI wizardry is currently here. And while there’s quite a lot more to add, such a way to select other sports or other summarization methods, it does the bulk of what I want:

1. It builds histograms of tweet’s based on three dimensions: the date, the hour, and the “cluster” of the tweet.
2. It lets you select sub-regions of these dimensions and automatically updates the histograms for other selection files. So if you put a range on the day, you can see the histograms according to hour and cluster for that date range.
3. Given a range, you can also see the most representative, or summary, tweets for the most frequent clusters in that range. There’s still a little bit missing, I should really be ordering the summaries by their time, but that’ll come later.

As complicated as all that initially sounds, I barely had to write any JavaScript on my own, which is truly fortunate since I barely know JavaScript.

## The joy of making that UI.

Since I know next to nothing about JavaScript, D3.js, and Crossfire, I did a lot of hacking, console debugging, and total guessing to make this beast work. So here’s a quick rundown on what these three things are doing together and how they synergize into my current app. There’s still quite a bit I don’t know, so i’ll mostly focus on what I figured out in my hackings.

Cross filter arrays of key-valye javascript objects, which can be easily pulled out of Comma Separated Files. However, those initial object arrays are totally untyped, so you need to do some processing to shuffle values out of raw strings into something more usable. I’m currently using two styles of data: 1) one format that simple records the time of a tweet and it’s cluster identifier and 2) one that records the cluster identifier, the time of the first tweet in that cluster, the time of the average tweet in that cluster, and the summary tweet. They’re pretty simple and look like this:

and

D3 makes this super easy to handle. All you do is call d3.csv(fileName, callback). In my example, this turns out to be:

and

### Crossing the filters on that data

Once you’ve got data loaded, you gotta do something with it, no? Crossfilter lets you do some super powerful things with very little work. The primary job of cross filter is to take your array of objects and let you select different dimensions to act as keys in that array. Initially your key is just the index of the array. But after calling dimension on a crossfiltered object, you can select any variable in your object to be a key. Since I wanted three charts, that means I need three keys: 1) a key on the day, 2) a key on the hour, and 3) a key on the cluster id. I also want counts for the number of tweets in the bins corresponding to each dimension. That sounds like a lot of work, but it’s as easy as this:

That’s it! All you need is two lines to select a dimension for your chart and compute the data for the histogram. Easy Breezy.

### Charting those groups

Now that you’ve got some dimensions set up and some counts to go along with them, it’s time to plot those fine numbers. For each chart you want, all you have to do is note what dimension you want to use, provide the summary counts, put some limits on the plots, then apply all that to some plotting object like a bar graph. I’m just using bar charts, but this other crossfilter example gives some sweet alternatives you can re-use.

### Printing the tweet summaries

The fun part is printing all the summaries for the tweets that have been selected. The original Crossfilter example was pretty simple, it just printout out the actual rows being selected in the histograms. But I wanted to do something more complicated. I wanted to figure out which clusters existed in the selection, get the summaries attached to each cluster (and only one copy of the summary per cluster), and then organize the summaries by date. Not knowing javascript, that sounded kinda hard. In my candy land language, [Scala][7], it’s pretty easy to do with some groupBys and maps, but does javascript have all this? YES! Turns out the clusters object computed to print the histogram has nearly everything I want, the collection of cluster identifiers found in the current filter selection. And since all arrays in JavaScript have a map operator, I can get the array of summaries I so desperately desired.

Next comes the cool part, creating a hierarchy on the cluster summaries based on the date. These two lines together do that magic:

The first line creates an object that will nest any array of items with a startTime attribute according to their day and the second line runs that nester over the cluster summaries to get a mapping from days to arrays of summaries occuring on each day. Using that nested object, you can build a table of tweet summaries for each day by attaching the data object to the list div holding those tables:

And that’s it! Again, easy breazy.

### Setting up the divs for the stuff you want

The last thing you need whenever you’re going to be mashing data into a website via D3 is some divs to host that data. For my application, I need just two types of divs: charts to hold the histograms and tables to hold the summaries. These look like:

They’re pretty dead simple. One chart for each histogram I’m plotting and a general div for the lists. The lists will get populated with more divs dynamically based on how many dates fall into a selected range.

## Creating the data to put in this app

So how did I get all these tweets? And how did I split them up into different clusters? That’s secret for now, but if my current research project is looking good, that’ll be come the topic of a new research paper, and if not, i’ll be the topic of a blog post describing what failed! So stay tuned!

# Attending two conference in Jeju, South Korea

Today is the last day of two co-located top tier conferences for Natural Language Processing and Computational Linguistics, ACL and EMNLP. There’s been six full days of presentations, posters, and many conversations over coffee breaks, lunch, and dinner. Being a lone researcher from two labs attending the conferences, i’ve done a lot of floating and mingling with different groups. Here’s a debrief of some of the most interesting talks and posters I’ve seen, some good conversations i’ve had, and some thoughts on other parts of the conferences.

## Talks and Posters

By a wide margin, the most intersting talks, in my opinion, were those that used clean and well structured (non-parametric) bayesian models to analyze text or speech. Here’s a short run down of the ones I liked most.

### A nonparametric Bayesian Approach to Accoustic Model Discovery

This paper tied together Hidden Markov Models and Dirichlet Process Mixture Models to automatically learn word phonemes in spoken data. Everytime an utterance is observed, guess at some word boundaries and then for each segment in the word boundary, they select a HMM from a Dirichlet Process to model the phoneme. Their performance was pretty good considering it was fully unsupervised and their model even learned finer distinctions between phonemes based on common speaker patterns that weren’t recorded in the test labels.

### SITS: A Hierarchical Non-parametric Model using Speaker Identity for Topic Segmentation in Multiparty Conversation

This model extends LDA to utilize speaker identities to automatically determine the topics each speaker prefers to use and the likelihood of a speaker changing a topic during a conversation. For example, in a Vice-Presidential debate, they captured Sarah Palin’s tendancy to completely change the topic of the conversation when responding to a question.

### Word Sense Disambiguation Improves Information Retrieval

WSD has often been linked to IR with mixed results. This late test attempt first performs WSD by learning word senses from aligned texts in Chinese and English, where word senses are determined by the presence of multiple translations of a word between the two languages and ways in which they can be translated. Using this information they were able to improve a standard IR baseline on the TREC dataset.

### Learning to “Read Between the Lines’ using Bayesian Logic Programming

This was my first exposure to Bayesian Logic Programming, which provide a nice and more scalable alternative to Markov Logic Networks. Using a pipeline system, they do some information extraction and then use a BLP to infer logical connections that can then be used to infer unsaid facts from observed information. For instance, If we know Barack Obama is President of the U.S., and we know that presidents, or leaders in general, are members of their group, then the BLP should probabilisticly infer that Barack Obama is a member of the U.S. even if it’s never said.

### Computational Approaches to Sentence Completion

This evaluated Language Model and Distributional Models using a SAT styled sentence completion test. They covered a lot of different approaches to this problem, includeing two ways of using LSA, a Recurrent Neural network, and standard N-Gram based Language Models. The LSA models clearly did best which suggests something should be done to enhance the other models. I’d like to see if the different models made different types of errors, for instance, did one handle world knowledge based questions better, like “the bike law was _ by congeress” where the blank should be “ratified”, and do others handle other types of questions better. If this is the case, then perhaps we can learn these situations and build a model to decide when to use each approach on a test question.

### Bayesian Symbol-Refined Tree Substitution Grammars for Syntactic Parsing

This model used a really elegant and simple three level Pitman-Yor process to combine two sophisticated parsing models together, Symbol-Refined trees and Tree Substitution Grammars. The first approach refines each node in syntax trees with finer categories and then second considers subtrees within a full tree. By using the three level Pitman-Yor model, given a parsed example, the first level process models the full tree, the second level models symbol refinements, and the third models possible tree substituions. Given the huge state space, they showed a simple but effect blocked sampling algorithm.

### Using Rejuvenation to Improve Particle Filtering for Bayesian Word Segmentation

This model uses an unsupervised Sequential Markov Model to represent word segments. They improved upon a prior online approach by including Rejuvenation to the sequential model, which resamples a subset of prior examples after modeling a new sample. By doing this, they model can fix past mistakes made when there was less knowledge about the dataset. This is mostly cool because I fancy Sequential Monte Carlo methods.

### Unsupervised Relation Discovery with Sense Disambiguation

This paper presented a really interested use of fully unsupervised Word Sense Induction methods. They used a very cute variant of LDA where documents are a series of extracted dependency relation tuples using the same connecting relation. The “words” were then feature vectors modeling each tuples words and the topics observed from the tuples source sentence and document. The WSD model showed good improvement over relation clustering, but has some limitations. The topics learned by LDA provide an initial clustering over the relations, then a second stage of HAC groups related topics. Would a fully non-parametric model create a more accurate number of clusters automatically by only generated topics when they are truely separate? Could this also be used to test different WSI approaches and give an indiciation of where they are lacking or what features are missing?

### Joining Forces Payes Off: Multilingual Joint Word Sense Disambiguation

This Model utilizes BabelNet with a simple ensemble model to really boost knowledge based WSD. For a given word in a context, they generate the possibile translations in multiple languages and then for every sense they compute a graph centrality score using BableNet using the candidate translations. Each translation serves as an individual WSD model which are combined with a simple probablistic mixture model. Given BabelNet, this looks really promising. It might be good to try using this over NYT with my Topic Model evaluations and use it as a comparison between Knowledge Based WSD and fully unsupervised WSI.

### The Appification of the Web and the Renaissance of Conversational User Interfaces

Patrick Pantel gave this amazing plenary talk and covered the coming issues and opportunities presented by the growing trend of users accessing the internet via siloed apps on mobile devices and the desire of users to perform tasks using a conversational approach. He presented a lot of really good issues that need consideration and solving. He also went into dept about a graphical model connecting user queries to user intentions and possible actions that can address those desires. I’m really fascinated by what can be done with a given model once it’s been built. For instance, if you have this kind of model and the user, which has some model of their general intentions or desired actions, then asks “what can i do that is interesting”, can you then project the dataset backwards into a view of possible actions and intentions that the user can then navigate through to explore possible things to do. Concretely, if the model knows I like certain types of food at certain prices, can it take that knowledge and then organize surrounding restaurants in terms of what i like so i can pick and choose these things easily, especially when I don’t really know what I want initially?

### Learning Constraints for Consistent Timeline Extraction

The motivating example behind this paper was the challenge of extracting semantically consistent facts regarding a single entity. For example, given many mentions about James T. Kirk and many events with him as a participant, can you extract the logically consistent facts, such as that he was born after his father and was a space cadet before being a captain or instructor? They do this with a joint supervised system that extracts facts and validates their consistency.

### Monte Carlo MCMC: Efficient Inference by Approximate Sampling

This paper was really exciting and elegant. When doing a Metropolis Hastings update in a graphical model, it requires computing the acceptance rate in terms of factors that change based on the update. However some of these factors can be expensive to compute, and there may be many many factors. To improve efficiency, they simply use a subset of the factors in a probability factorization to compute the acceptance rate, using two possible methods for selecting that subset. First, they consider a random subset of a uniform size, and alternatively, they sample more and more samples until the samples have an estimated mean with 95% confidence, allowing different proposals to use different sized subsets of factors and giving the inference more flexibility. In short, really simple, really smart, and really effective.

### Discovering Diverse and Salient Threads in Document Collections

News corpora cover many related artciles about a string of events. This approach uses a simple graph analysis algorithm to extract chains of articles that cover a single topic spread accross many documents. The main issues seem to be with scale and how to select how many and which threads to extract. For instance, can you first impose a hierarchy of topics over the dataset and then select threads within that decomposition? Is there some number of diverse threads that sufficiently describe the dataset? And how would you learn a lot of different threads?

### Constructing Task-Specific Taxonomies for Document Collection Browsing

This sounds really cool, but i didn’t have a chance to see the presentation and haven’t read it yet.

### Semantic Compositionality through Recursive Matrix-Vector Spaces

Andrew Ng currently seems to have a man crush on neural networks. He and a student show a neat improvement upon Recursive Neural Networks, which can do compositionality between phrases that have non-linear properties. Take the phrase “absolutely sad”, it does some funny things composition wise, and existing RNN models can’t handle it. They go a step beyond by projecting each component phrase’s vector representation by a matrix modeling the other word’s uses. So in the example, sad modifies absolutely and visa versa, then the joint vector for both modified words gets pushed through a standard RNN. It worked out pretty well, but requires training. I’d love to see something here that doesn’t need training.

## Conversations with other people

My talk about Exploring Topic Coherence over many models got some interesting responses that were really interesting.

1. First, we didn’t really vary the feature weighting mechanism before doing LSA and PLSA, so this may be affecting the performance of those models with information that LDA doesn’t have, if we try other schemes, or no scheme, what happens? does that mean we could improve LDA if we have a way of encoding that information?
2. How doe the models compare when using corpora of different sizes? We could use TASA as an example of a smaller corpus, or UKWac and Wikipedia as larger corpora. Does LDA get better representations with less information or does SVD? Can we then improve either method with those lessons.
3. What about word senses? Can we pos tag and word sense tag the corpus and then see improvement along multiple lines? When doing the coherence evaluation, we would probably just have to use the raw tokens initially, but for the word similarity and document evaluations, we can use the sense tagged words. Issues with this: How do we do sense tagging? SemCore is reasonably sized but WSD models kinda suck. Are they sufficient to improve results? What about WSI models? If WSI models do something positive, this would be a good evaluation of those models without actually knowing what kind of senses or how many we need to learn.

## The location

I wasn’t such a huge fan of the conferences being located on Jeju island. There were things to do on the island, but they were all expensive, as was the food, in my opinion. Furthermore, the banquet was mildly terrible. We had it in this tiny restaurant next to the ocean, which had a nice view granted, but was too small for the attendees and it was a buffet, so a hundred computational linguists, which all heard about Chinese Restaurant Processes got to wait in line to sample tasty Korean food. As awesomely ironic as it was, it made getting food a terrible experience. I’d have prefered the be in some major city or college town where things are cheap (note how the 2010 meeting of ACL was amazing). Jeju is nice and all, especially if i had the time and money to explore the island more, but I had neither really, and I think nearly all students felt the same way.

# Gamma Ramma

### Before getting to Cooler Mixture Models

Before going into the depths that is ways to improve the simple Multinomial Mixture Model that I discussed in the last post, I want to give a little adventure story regarding the Gamma distribution. Why the Gamma distribution? Mainly because it is deeply tied to Gaussian distributions and those will factor heavily into cool things you can do in mixture models.

### This thing called my Conjugate Prior

So how is the Gamma distribution related to a Gaussian distribution? The short story is that a properly parameterized Gamma distribution can approximate the precision (or inverse variance) for a Gaussian distribution (the proper wording for this relationship is that a Gamma distribution is a conjugate prior for a Gaussian distribution when you know the mean but don’t know the precision). This turns out to be super powerful when you have some idea of what the mean of your Gaussian is but need to estimate the variance. However, this distributions has some flavors that can really wreck havoc on your ability to learn these values if you use them in the wrong place, and similarly people are really terrible at pointing out which flavour you can and should use. So here’s an adventure on taste testing the Gamma distribution.

### Sampling from Gamma Distributions

Let’s start with an example. Let’s say we have two one dimensional Gaussian distributions. Each distribution is defined by two parameters: $\mu$ for the mean, or center point, and $\sigma2$ for the variance, or inverse precision. So let’s say our two distributions have these two parametrizations: $\mu_1 = 5.0, \sigma2_1 = 1.0$ and $\mu_2 = 10.0,\sigma2_2 = 6.0$. Here’s a nice density plot of these two distributions together:

Now let’s look at what the Gamma distribution can do if we sample from it with the “proper” parameters (which I’ll point out later on).

It looks like a good approximation of what we know the two variance values to be and the samples can be generated really easily with Scala’s breeze library which you’ll see in the next section. But let’s for funsies sake say that we don’t like using breeze (say because we hate Scala and prefer to trust libraries written in Java), we may then want to use some other implementation of this distribution, like say in Apache Commons Math or Java’s Colt library. What happens if we take the parameters we passed into breeze and passed them to these libraries? What comes out? Let’s See!

That looks kinda funky. What’s going on? This exposes the difference between the two main flavors of this distribution: two related but poorly explained ways to parameterize the model! Both flavours depend on a shape parameter that (aptly named) guides the shape of the distribution. The difference in the flavours is the use of either a scale parameter, sometimes referred to as $\theta$, or a rate parameter, sometimes denoted as $\beta$. Is there a relation between these two? Totally! They are inverses of each other, i.e. $\theta = 1/\beta$.

### Finding the funky culprit

Now how do we know which of the above three implementations is doing the right thing? One way is to use the Inverse Gamma distribution. Samples from an Inverse Gamma distribution should (roughly) be the reciprocal of samples from a Gamma distribution with the same parameters. But these packages don’t really have an implementation of an Inverse Gamma distribution, so what else can we do to check? Well, Colt looks to be doing two really wierd things that violate our intuitions about variances: (a) one set of variance estimates are completely missing and (b) the one that does exist gives negative variance values, and we know this to be impossible since the variance is defined to be an average of real values raised to the power of two. And if we read the Colt javadoc more carefully, we can figure out that they use the rate and not the scale, but it’s never totally obvious in their documentation. So if we fix that in our parameterization, we get this agreeable set of plots:

That’s much better looking.

### Finding the magic parameters

Now let’s dive deeper and see how I made all these (now fixed) samples:

In this snippet, I’ve left out data, but it’s simply a map for accessing the list of samples I drew earlier to make the Gaussian plot. The key equations here are for computing what I’m calling alpha and beta_*. Looking at a handy table of conjugate priors for continuous likelihood distributions, we find these two core equations used in the code:

Where $n$ is the number of points in the group and $x_i$ denotes points within coming from the same distribution. Looking at that table of conjugate priors and the list created in the first for loop, you might wonder why i use the reciprocal of the $\beta$ values, i.e. the scales. This is because of the cute footnote in the conjugate prior table noting that

$\beta$

as computed above is in fact the rate.

### An even more terrifying parametrization

Now if you thought what I’ve touched on is cool, a little funky, and possible frustrating in the minor differences, it’s time to really complicate things. A really smart man by the name of Carl von Rasmussen designed an Infinite Gaussian Mixture Model that depends heavily on the Gamma distribution for just the purpose I’ve described. BUT he gives this parameterization:

Throw this into our sampling code and we get this rediculous plot:

Which is totally wrong. So what gives? Well, at a much later date, Mr. Rasmussen points out that as he defines them, the two parameters are slightly mutated versions of the shape and scale as we’ve described them so far. The real way to use his parameters is to mutate them by doing this:

### Summary!

So to summarize this adventure time blast, I have one key lesson: figure out which parameterization your software is using and which parameterization a model designer is using and make sure they match. Otherwise you’ll spend a terrible amount of time wondering why you just can’t estimate those variances accurately. I failed to do this a few weeks ago and was quite befuddled for a while, so don’t be like me.

# Topic Model Comparisons: How to Replicate an Experiment

We (Keith Stevens, Philip Kegelmeyer, David Andrzejewski, and David Buttler) published the paper Exploring Topic Coherence over many models and many topics (link to appear soon) which compares several topic models using a variety of measures in an attempt to determine which model should be used in which application. This evaluation secondly compares automatic coherence measures as a quick, task free method for comparing a variety of models. Below is a detailed series of steps on how to replicate the results from the paper. Note that these instructions are cross posted as part of this GitHub project.

The evaluation setup breaks down into the following steps:

1. Select a corpus and pre-process.
2. Remove stop words, infrequent words, and format the corpus.
3. Perform topic modelling on all documents
4. Compute topic coherence measures for induced topics
5. Compute word similarities using semantic pairing tests
6. Compute Classifier accuracy using induced topics

Each of these steps are automated in the bash scripts provided in this repository. To run those scripts read the last section for downloading the needed components, setting parameters, and then watching the scripts blaze through the setup.

The rest of this writeup explains each step in more detail than was permitted in the published paper.

## Selecting the corpus

The evaluation requires the use of a semantically labeled corpus that has a relatively cohesive focus. The original paper used all articles from 2003 of the New York Times Annotated Corpus provided by the Linguistics Data Consortium.
Any similarly structured corpus should work.

The New York Times corpus requires some pre-processing before it can be easily used in the evaluation. The original corpus comes in a series of tarballed xml files where each file looks something like this:

This leaves out a lot of details, but it covers the key items we will need: (1) the full text of the article and (2) all online_sections for the article. Extracting this can be kinda hairy. The following snippet gives a gist of how to extract and format the necessary data:

Before printing the data, we also need to tokenize everything. We used the Open NLP MaxEnt tokenizers. First download the english MaxEnt tokenizer model here then do the following before processing each document

And then do the following to each piece of text extracted:

This should generate one line per document in the format

With properly tokenized text and a series of stop words removed..

## Filtering tokens

In order to limit the memory requirements of our processing steps, we discard any word that is not in the list of word similarity pairs or the top 100k most frequent tokens in the corpus. The following bash lines will accomplish this:

Once we’ve gotten the top tokens that’ll be used during processing, we do one more filtering of the corpus to reduce each document down to only the accepted words and discard any documents that contain no useful content words. Running FilterCorpus with the top tokens file and the corpus file will return a properly filtered corpus.

## Topic Modeling

With all the pre-processing completed, we can now generate topics for the corpus. We do this using two different methods (1) Latent Dirichlet Allocation and (2) Latent Semantic Analysis. Unless otherwise stated, we we performed topic modeling using each method for 1 to 100 topics, and for 110 to 500 topics with steps of 10.

### Processing for Latent Dirichlet Allocation

We use Mallet’s fast parallel implementaiton of Latent Dirichlet Allocation to do the topic modelling. Since Mallet’s interface does not let us easily limit the set of tokens or set the indices we want each token to have, we provide a class to do this: TopicModelNewYorkTimes. This takes five arguments

1. The set of content words to represent
2. Number of top words to report for each topic
3. The documents to represent
4. The number of topics
5. A name for the output data.

And we run this with the following command.

## Using the automated script

The writeup so far has described the steps we used to compute each experiment in more detail than provided in the original paper. However, to make this even easier to replicate, we’ve provided a run script that automates the process as much as possible. This section describes the minimum number of steps needed to setup the script and do the processing. However, since many of the computations are embarrassingly parallel, we didn’t use this exact script to do our processing. Where noted, we used the same scala class files and inputs, but parallelized the large number of runs using Hadoop Streaming. Since Hadoop Streaming can be highly frustrating and finicky, we leave that parallelizing up to you.

Before using the script, you need to download and prepare a few key files that we cannot distribute:

1. The New York Times Annotated Corpus for the Linguistics Data Consortium.
After downloading this, unzip the 2003 portion of the corpus. Then set the nytCorpusDir variable to point to that directory. If you’ve set it up properly it should have a subdirectory for each month, each of which has subdirectories for each day of that month which holds the articles written in that month.
2. Download a OpenNlp Maximum Entropy tokenize model from here. Set the tokenizerModel variable to the location of this file.
3. Download a stop word file. We used the english-stop-words-large.txt file provided by the S-Space package. Set the stopWords variable to the location of this file.
4. Download the Wackypedia corpus and set the externalCorpusDir variable to it’s location.

Once those variables have been set and the data has been downloaded, the script should run using the same parameters we used in our experiments. It will eventually produce a large number of word by topic matrices, document by topic matrices, list of top words per topic files, and a series of data files for each experiment that can easily be plotted using R.

If you wish to change some variables, here’s the meaning of each one:

• numTopTokens: The number of words in the corpus to represent (not including words in the semantic similarity judgements).
• numTopWordsPerTopic: The number of top words to report for each topic
• transform: The transformation type used for building the Term Document Matrix used by LSA.
• topicSequence: A sequence of numbers indicating how many topics to learn for each model
• lastTopc: The largest number of topics requested
• exponents: A sequence of numbers indicating the exponent corresponding to each value of epsilon used by the coherence metrics.
• numFolds: The number of stratified folds to compute for the classifier evaluation
• minLabelCount: The minimum number of times each section label should occur
• port: The port number for the coherence server.

All other variables used indicate location and names of files generated by the run script. If space is a concern, set topicDir to a location with a large amount of disk space, most of the generated results will be stored in that location. All other location parameters can be similarly changed as needed without affecting the run script (please notify us by filing an issue if this statement turns out to be wrong).

# Sonatype and Scala: A Relationship of Trickery

Open source projects are amazing. They let you easily share the work you do with others so that they can iterate and improve upon your ideas, an especially invaluable quality when you later move on to other things. Maven in a very similar fashion makes sharing open source (or even closed source) java projects super easy. If you structure your codebase in a Maven way and register an account with Sonatype, you can get your project’s jars distributed worldwide so that anyone and everyone can used your stuff without having to download and compile your code.

For pure java projects, this process is utterly trivial, just follow some instructions and it all works out beautifully. But java is verbose and often prefer to use Scala, especially for small prototype systems. But how do you deploy a mavenized scala project with Sonatype? Well, it all pretty much works out except that you need to create a jar of javadocs before Sonatype will let you publish anything. Not being java and all, scala does not do this out of the box, and maven won’t either. But, there’s a cute hack you can do if you’re using maven 3.0. Just add this snippet to your pom file and watch the deployment rock:

This uses the versitile scala maven plugin and runs the doc-jar goal to build scala docs every time you type mvn package or some command that depends on mvn package, this includes mvn deploy. The id bit creates a javadoc.jar containing the scaladoc. So with just that tidbit, you can deploy your maven projects to Sonatype with no issues at all.

# Stepping Into the Mixtures With Gibbs Sampling

### Refresher

As stated before here and elsewhere, mixture models are pretty cool. But rather than diving head first into a complicated mixture model like the Dirichlet Process Mixture Model or the Hierarchical Dirichlet Process, we’re going to ease our way into these models, just like you’d learn to accept the horrors of a pool of water before you try deep sea diving. First, we’ll figure out how to apply the Gibbs Sampling Method with the Linked List of machine learning methods, Naive Bayes. Then, after Gibbs Sampling feels comfy, we’ll tackle a Finite Mixture Model with Gaussian components. And with the magic of Scalala and Scalanlp the code’ll be short, sweet, and easy to understand.

### The Toy of Machine Learning Methods

Simple but instructive data structures like a Linked List work well as an instructive tool. Good libraries provide solid and efficient implementations, but it’s straightforward enough for beginners to implement in a variety of manners. For our goals, Naive Bayes will serve the same purpose. As simple as the model is, it’s reasonably powerful.

So what is the Naive Bayes model? Let’s say we want to group a bundle of emails into two groups: spam and not spam. And all you have to go on are the contents of the emails. In this situation, we’d like to just use the words in a single email to decide whether or not it’s spamtastic or spamfree. Naive Bayes assumes that you can make this decision based on each word in the email individually and then combine these decisions. It’s Bayesian because it’s a simple Bayesian Model and it’s Naive because this approach assumes the words are totally independent and have no statistical relation whatsoever.

### Model Representations

Due to it’s simplicity, and that naive assumption, Naive Bayes makes for an easy model to understand and describe using a variety of models. here’s some of the most frequent descriptions:

#### A Bayesian Network

These two views of Naive Bayes describe the same model, but different ways of computing it. For the model on the right, if we work backwards, it says that a data point $X_j$ comes from some multinomial distribution $\theta_{K_j}$. $K_j$ is a latent variable that encode which group, i.e. spam or not spam, the data point came from, and that itself comes from a multinomial $\phi$. The two $\theta_s, \theta_n$ distributions describe our knowledge about spam and non-spam emails, respectively. And finally, all three multinomial distributions come from Dirichlet distributions. The added magic of the Dirichlet distrubtion is that it generates multinomial distributions, this relationship is more formally known as a Conjugate Prior. The plate diagram on the left says nearly the same story except it collapses out the $\phi$ distribution and links $\alpha$ directly to the compnent labels. We could also in theory collapse the $\theta$ distrubtions, but we’ll not do that for the moment.

#### A Graphical Model

This graphical view tells an alternative story. Instead of probability distrubutions making other distributions and then finally making data points. this says that the class of a data point, $Y$ generates the features of the data point, $, w_1, w_2, \cdots, w_v$. This is where the naive part comes from, each “feature”, i.e. word in the email, is completely disconnected from the other words and only depends on the class of the email.

#### A Probability Distribution

Finally, there’s the written way to understand Naive Bayes, as a conditional probability distrubtion. Let’s write out the distribution for just the spam class:

This looks pretty much the same for non-spam. Simple, no? If we combine together the different views above with this written form, then we can think of the class $spam$ as a Multinomial distribution over words and the probability of a word given the class spam is really just the probability of picking the word based on how many times it’s shown up in spammy emails. Similarly, the probability of seeing spam is just the number of times you’ve seen a delcious piece of spam, be it on your phone, at home, or in the frying pan. Using that intuition and some Bayesian math, the above definition gets boild down to:

Let’s get some definitions out of the way.

Let’s tease this appart into two pieces, $p(spam)$ and $p(email|spam)$. The first part is:

This basically says that the probability of spam is the number of time’s we’ve seen it, plus some smoothing factors so that both span and non-spam have some positive possibility of existing. Amazingly! this is also equivalent to the collapsed $\phi$ distribution we mentioned in the bayesian description of Naive Bayes, by doing just a little bit of math and realizing that the Dirichlet is a conjugate prior of a multinomial, we get the above distribution for each class.

Next,

If $\theta_s$ describes the probability of each word occuring in a spammy email, then $\theta_{w,s}{C_{w,e}}$ becomes the likelihood of seeing that word as many times as observed in the email. By computing the product of all these words, we naively get the probability of the full email. Slam the two distributions together and we get the full likelihood of the email given the class spam.

### Don’t forget Gibbs!

Now that Naive Bayes makes some more sense in a variety of descriptive flavours and we’ve defined our distrubutions and hyper parameters, it’s time to throw in gibbs sampling. To quickly refresh our memories, the recipe for gibbs at this stage is:

1. for each data point $X_j$
1. Forget about $X_j$
2. Compute the likelihood that each component, $\theta_*$, generated $X_j$
3. Sample and jot down a new component, $K_j$, for $X_J$
4. Remember $X_J$
2. Restimate our component parameters, $\theta_{*}$
3. Repeat

You might be wondering “why forget about the current data point?” The reason is that we’re trying to decide what to do with a new data point. Gibbs sampling only really works because the distrubtions we’re working with are exchangeable, i.e. the ordering of our data doesn’t matter. So we can just pretend any data point is new by just forgetting it ever existed, Note, this is why $p(spam)$ has a $-1$ in the denominator, that count has been “forgotten”. Other than that tidbit, we just pretend everything else is the same.

### Writing out the code

With Gibbs sampling fresh in our minds, and our probability distributions clear, let’s see what some real code looks like. To do this, I’m going to use Scalala and Scalanlp, which take care of the linear algebra and probability distrubtions for us really concisely. And while it may not be the fastest library around, it’s super easy to read and understand. For faster implementations, you can just translate this code into whatever beast you desire and optimize away.

#### Setting up before sampling

Before we go on a spam sampling binge, we gotta setup our initial groupings. Gibbs sampling works pretty well with a random starting point, so let’s go with that:

So what’s going on here? Well, first, we just get some basic stats about our data such as the size, the number of features and the number of components we want, in this case, 2. Then we make a data structure that will hold the number of times we see each feature within each component, labelStats; the number of times we’ve seen each component overall labelCounts; and the parameters for each compnent, thetas. Finally, randomly assign each point to a random component and update the labelStats and labelCounts. Note that we get the thetas from a Dirichlet distribution that uses the labelStats and a smoothing parameter gamma, just like we noted in the Bayesian Plate diagram above.

Now for the real sampling meat:

The first three lines in the loop get the current label for x_j and then forget about it’s data. prior represents the likelihood of selecting each component just based on it’s frequency, i.e. $p(spam)$. likelihood represents the likelihood of each component generating x_j, i.e. $p(e|s)$. To finish off the sampling procedure, we sample a new label by turning these likelihoods into a multinomial distrubtion and sample from it. With the new label we then add in the data for x_j to the that component’s stats. Once we’ve made a full sweep through the data set, we update our parmeters for each component by resampling from a Dirichlet distribution by using the our labelStats`.

And Voila! We have a simple gibbs sampler for Naive Bayes.

### Running it on some data

Let’s see how this unsupervised version of Naive Bayes handles a really generic looking dataset: globs of data points from three different Gaussian distributions, each with a different mean and variance. And let’s start off with just trying to find 2 groups, as we’ve been working along with so far.

Looks pretty neat huh? Even with a terrible and completely random starting point, our learner manages to split up the data in a sensible way by finding a dividing plane between two groups. What’s even cooler, is that notice how after iteration 1 completes, nearly all the data is assigned to one class. By somehow by iteration 10, we’ve managed to finagle our way out of that terrible solution and into a significantly better solution. And by the 100th iteration, we’ve stayed there in a stable state.

Now what happens if we try finding 3 components in the data set?

Here we get a similar split, but not what we’re looking for. Our mixture model gets all befuddled with the group on the bottom left and tries to flailingly split into two smaller groups. If you run this a couple times, you’ll see that it does the same thing with the two green groups, it can’t split them evenly as you’d expect.

### The Secret no one mentions

This simple model i’ve walked through is the unsupervised equivalent to the supervised Naive Bayes model. It does a pretty reasonable job of splitting apart some groups of data, but there are clearly datasets it has some issues dealing with. But what’s not been said so far is that this model is actually not just Naive Bayes, it’s a Finite Mixture Model with Multionomial components. The code, probabilities, and graphs i’ve layed out all work for more than two components. So next, i’ll show two cool ways to extend this model:

1. With Gaussian components, which can better handle this toy data set
2. With an infinite number of components, which figure out that paramter for you like a magician.

# Getting to the Mixtures in Dirichlet Process Mixture Models

The Dirichlet Process is pretty sweet as Edwin Chen and numerous others have pointed out. When combined with a mixture of Bayesian models, it’s really effective for finding an accurate number of models needed to represent your bundle of data, especially useful when you have no idea how many bundles of related words you may have. Edwin Chen does a really great job of explaining how the Dirichlet Process works, how to implement it, and how it can break down the McDonald’s menu, but he leaves out details on how to combine it with mixture models, which is by far the coolest part.

The short description for how to merge the Dirichlet Process with Nonparametric Mixture Modeling (and infer the model) looks like this:

1. Define probability distributions for your mixture model
2. Rewrite your distributions with Bayes rule so that parameters depend on data and hyper parameters
3. Repeat step 2 for hyper parameters as desired (it’s turtles all the way down, as Kevin Knight said)
4. For each data point
1. Forget about the current data point
2. Compute the likelihood that each mixture made this data point
3. Sampling a new label for this data point using these likelihoods
4. Jot down this new label
5. Remember this data point
5. After seeing all the data, re-estimate all your parameters based on the new assignments
6. Repeat steps 4 and 5 ad nausea.

In short, this is the Gibbs Sampling approach. If you already know what your distributions look like, and this paltry description made perfect sense, Hurray! Please go bake a cake for everyone that feels a little underwhelmed with this description.

Now that someone is off baking us cake, let’s dive down into what the distributions look like. Taking a look at a number of papers describing variations of the Dirichlet Process Mixture Model, like The Infinite Gaussian Mixture Model and Variational Inference for Dirichlet Process Mixture Models, they first make things seem pretty straightforward. To start, they generally throw down this simple equation:

Simple, no? If you want mixtures of Gaussian models, then $\theta_j$ has a mean and a variance, we’ll call these $\mu_j$ and $\sigma_j2$. The next step is then to rewrite this equation in a couple of ways so that you define the parameters, i.e. $\theta_j$ in terms of the data and other parameters. One example: $\mu_j$ looks like this:

Not too bad after reading through the definition for all the parameters; the first bundle describes the mean of the mixture and the second bundle describes the variance of the mixture. You may have noticed that our means for each mixture come from a Gaussian distribution themselves (I did say it was turtles all the way down). If you’re using a nice math coding environment, like Scalala and Scalanlp, you can easily sample from this distribution like this

This goes on for most of the parameters for the model until you get to one of the core tear inducing rewrites. This lovely gem describes the probably distribution function for one of the hyper parameters in the Infinite Gaussian Mixture Model:

This is not easy to sample from especially if you’re new to sophisticated sampling methods. And sadly, just about every academic publication describing these kinds of models gets to a point where they assume you know what they’re talking about and can throw down Monte Carlo Markov Chain sampling procedures like they’re rice at a wedding.

So what’s a lowly non-statistician data modeller to do? Start with a well written description of a similar, but significantly simpler model. In this case, Gibbs Sampling for the Uninitiated, which describes how to apply Gibbs to the Naive Bayes model. I’ll summarize this awesome paper in the next part of this series and then tweak it a little to make a Finite Gaussian Mixture Model.
After that’s said and done, I’ll show how to extend the finite version into the Dirichlet Process Mixture Model (with some simplifications).