# Building Visualizations to Test Summarizations

I’m currently working on interesting on-line methods for summarizing streams of documents. The basic idea is that documents come hurtling into your inbox at a startling rate and you’d like a quick, easy, online method to summarize that they’re about. A lot of approaches to text summarization use an offline approach, meaning that those methods inspect all the documents. That’s not practical, especially if you want to, oh say, do this thing on all the tweets about the ongoing 2012 Olympics in London. So my goal is to work up a good enough algorithm for doing this process completely online. Even though it sadly won’t be working well enough to actually run online while the Olympics is going on (I’m still working on said algorithm), it could be pretty cool.

However, figuring out if you’re doing something right or wrong on many million tweets about 50 different sports is kind of challenging. So while i’m gathering ton of data to process, and then processing it all, I figured I should design a night UI for exploring the results. Being a terrible UI guy I thought I could never pull it off, but thanks to the magicians behind Crossfilter and D3.js, it turned out to be pretty easy. The result of my UI wizardry is currently here. And while there’s quite a lot more to add, such a way to select other sports or other summarization methods, it does the bulk of what I want:

1. It builds histograms of tweet’s based on three dimensions: the date, the hour, and the “cluster” of the tweet.
2. It lets you select sub-regions of these dimensions and automatically updates the histograms for other selection files. So if you put a range on the day, you can see the histograms according to hour and cluster for that date range.
3. Given a range, you can also see the most representative, or summary, tweets for the most frequent clusters in that range. There’s still a little bit missing, I should really be ordering the summaries by their time, but that’ll come later.

As complicated as all that initially sounds, I barely had to write any JavaScript on my own, which is truly fortunate since I barely know JavaScript.

## The joy of making that UI.

Since I know next to nothing about JavaScript, D3.js, and Crossfire, I did a lot of hacking, console debugging, and total guessing to make this beast work. So here’s a quick rundown on what these three things are doing together and how they synergize into my current app. There’s still quite a bit I don’t know, so i’ll mostly focus on what I figured out in my hackings.

Cross filter arrays of key-valye javascript objects, which can be easily pulled out of Comma Separated Files. However, those initial object arrays are totally untyped, so you need to do some processing to shuffle values out of raw strings into something more usable. I’m currently using two styles of data: 1) one format that simple records the time of a tweet and it’s cluster identifier and 2) one that records the cluster identifier, the time of the first tweet in that cluster, the time of the average tweet in that cluster, and the summary tweet. They’re pretty simple and look like this:

and

D3 makes this super easy to handle. All you do is call d3.csv(fileName, callback). In my example, this turns out to be:

and

### Crossing the filters on that data

Once you’ve got data loaded, you gotta do something with it, no? Crossfilter lets you do some super powerful things with very little work. The primary job of cross filter is to take your array of objects and let you select different dimensions to act as keys in that array. Initially your key is just the index of the array. But after calling dimension on a crossfiltered object, you can select any variable in your object to be a key. Since I wanted three charts, that means I need three keys: 1) a key on the day, 2) a key on the hour, and 3) a key on the cluster id. I also want counts for the number of tweets in the bins corresponding to each dimension. That sounds like a lot of work, but it’s as easy as this:

That’s it! All you need is two lines to select a dimension for your chart and compute the data for the histogram. Easy Breezy.

### Charting those groups

Now that you’ve got some dimensions set up and some counts to go along with them, it’s time to plot those fine numbers. For each chart you want, all you have to do is note what dimension you want to use, provide the summary counts, put some limits on the plots, then apply all that to some plotting object like a bar graph. I’m just using bar charts, but this other crossfilter example gives some sweet alternatives you can re-use.

### Printing the tweet summaries

The fun part is printing all the summaries for the tweets that have been selected. The original Crossfilter example was pretty simple, it just printout out the actual rows being selected in the histograms. But I wanted to do something more complicated. I wanted to figure out which clusters existed in the selection, get the summaries attached to each cluster (and only one copy of the summary per cluster), and then organize the summaries by date. Not knowing javascript, that sounded kinda hard. In my candy land language, [Scala][7], it’s pretty easy to do with some groupBys and maps, but does javascript have all this? YES! Turns out the clusters object computed to print the histogram has nearly everything I want, the collection of cluster identifiers found in the current filter selection. And since all arrays in JavaScript have a map operator, I can get the array of summaries I so desperately desired.

Next comes the cool part, creating a hierarchy on the cluster summaries based on the date. These two lines together do that magic:

The first line creates an object that will nest any array of items with a startTime` attribute according to their day and the second line runs that nester over the cluster summaries to get a mapping from days to arrays of summaries occuring on each day. Using that nested object, you can build a table of tweet summaries for each day by attaching the data object to the list div holding those tables:

And that’s it! Again, easy breazy.

### Setting up the divs for the stuff you want

The last thing you need whenever you’re going to be mashing data into a website via D3 is some divs to host that data. For my application, I need just two types of divs: charts to hold the histograms and tables to hold the summaries. These look like:

They’re pretty dead simple. One chart for each histogram I’m plotting and a general div for the lists. The lists will get populated with more divs dynamically based on how many dates fall into a selected range.

## Creating the data to put in this app

So how did I get all these tweets? And how did I split them up into different clusters? That’s secret for now, but if my current research project is looking good, that’ll be come the topic of a new research paper, and if not, i’ll be the topic of a blog post describing what failed! So stay tuned!