# Topic Model Comparisons: How to Replicate an Experiment

We (Keith Stevens, Philip Kegelmeyer, David Andrzejewski, and David Buttler) published the paper Exploring Topic Coherence over many models and many topics (link to appear soon) which compares several topic models using a variety of measures in an attempt to determine which model should be used in which application. This evaluation secondly compares automatic coherence measures as a quick, task free method for comparing a variety of models. Below is a detailed series of steps on how to replicate the results from the paper. Note that these instructions are cross posted as part of this GitHub project.

The evaluation setup breaks down into the following steps:

1. Select a corpus and pre-process.
2. Remove stop words, infrequent words, and format the corpus.
3. Perform topic modelling on all documents
4. Compute topic coherence measures for induced topics
5. Compute word similarities using semantic pairing tests
6. Compute Classifier accuracy using induced topics

Each of these steps are automated in the bash scripts provided in this repository. To run those scripts read the last section for downloading the needed components, setting parameters, and then watching the scripts blaze through the setup.

The rest of this writeup explains each step in more detail than was permitted in the published paper.

## Selecting the corpus

The evaluation requires the use of a semantically labeled corpus that has a relatively cohesive focus. The original paper used all articles from 2003 of the New York Times Annotated Corpus provided by the Linguistics Data Consortium.
Any similarly structured corpus should work.

The New York Times corpus requires some pre-processing before it can be easily used in the evaluation. The original corpus comes in a series of tarballed xml files where each file looks something like this:

This leaves out a lot of details, but it covers the key items we will need: (1) the full text of the article and (2) all online_sections for the article. Extracting this can be kinda hairy. The following snippet gives a gist of how to extract and format the necessary data:

Before printing the data, we also need to tokenize everything. We used the Open NLP MaxEnt tokenizers. First download the english MaxEnt tokenizer model here then do the following before processing each document

And then do the following to each piece of text extracted:

This should generate one line per document in the format

With properly tokenized text and a series of stop words removed..

## Filtering tokens

In order to limit the memory requirements of our processing steps, we discard any word that is not in the list of word similarity pairs or the top 100k most frequent tokens in the corpus. The following bash lines will accomplish this:

Once we’ve gotten the top tokens that’ll be used during processing, we do one more filtering of the corpus to reduce each document down to only the accepted words and discard any documents that contain no useful content words. Running FilterCorpus with the top tokens file and the corpus file will return a properly filtered corpus.

## Topic Modeling

With all the pre-processing completed, we can now generate topics for the corpus. We do this using two different methods (1) Latent Dirichlet Allocation and (2) Latent Semantic Analysis. Unless otherwise stated, we we performed topic modeling using each method for 1 to 100 topics, and for 110 to 500 topics with steps of 10.

### Processing for Latent Dirichlet Allocation

We use Mallet’s fast parallel implementaiton of Latent Dirichlet Allocation to do the topic modelling. Since Mallet’s interface does not let us easily limit the set of tokens or set the indices we want each token to have, we provide a class to do this: TopicModelNewYorkTimes. This takes five arguments

1. The set of content words to represent
2. Number of top words to report for each topic
3. The documents to represent
4. The number of topics
5. A name for the output data.

And we run this with the following command.

## Using the automated script

The writeup so far has described the steps we used to compute each experiment in more detail than provided in the original paper. However, to make this even easier to replicate, we’ve provided a run script that automates the process as much as possible. This section describes the minimum number of steps needed to setup the script and do the processing. However, since many of the computations are embarrassingly parallel, we didn’t use this exact script to do our processing. Where noted, we used the same scala class files and inputs, but parallelized the large number of runs using Hadoop Streaming. Since Hadoop Streaming can be highly frustrating and finicky, we leave that parallelizing up to you.

Before using the script, you need to download and prepare a few key files that we cannot distribute:

1. The New York Times Annotated Corpus for the Linguistics Data Consortium.
After downloading this, unzip the 2003 portion of the corpus. Then set the nytCorpusDir variable to point to that directory. If you’ve set it up properly it should have a subdirectory for each month, each of which has subdirectories for each day of that month which holds the articles written in that month.
2. Download a OpenNlp Maximum Entropy tokenize model from here. Set the tokenizerModel variable to the location of this file.
3. Download a stop word file. We used the english-stop-words-large.txt file provided by the S-Space package. Set the stopWords variable to the location of this file.
4. Download the Wackypedia corpus and set the externalCorpusDir variable to it’s location.

Once those variables have been set and the data has been downloaded, the script should run using the same parameters we used in our experiments. It will eventually produce a large number of word by topic matrices, document by topic matrices, list of top words per topic files, and a series of data files for each experiment that can easily be plotted using R.

If you wish to change some variables, here’s the meaning of each one:

• numTopTokens: The number of words in the corpus to represent (not including words in the semantic similarity judgements).
• numTopWordsPerTopic: The number of top words to report for each topic
• transform: The transformation type used for building the Term Document Matrix used by LSA.
• topicSequence: A sequence of numbers indicating how many topics to learn for each model
• lastTopc: The largest number of topics requested
• exponents: A sequence of numbers indicating the exponent corresponding to each value of epsilon used by the coherence metrics.
• numFolds: The number of stratified folds to compute for the classifier evaluation
• minLabelCount: The minimum number of times each section label should occur
• port: The port number for the coherence server.

All other variables used indicate location and names of files generated by the run script. If space is a concern, set topicDir to a location with a large amount of disk space, most of the generated results will be stored in that location. All other location parameters can be similarly changed as needed without affecting the run script (please notify us by filing an issue if this statement turns out to be wrong).