I’m currently working on interesting on-line methods for summarizing streams of documents. The basic idea is that documents come hurtling into your inbox at a startling rate and you’d like a quick, easy, online method to summarize that they’re about. A lot of approaches to text summarization use an offline approach, meaning that those methods inspect all the documents. That’s not practical, especially if you want to, oh say, do this thing on all the tweets about the ongoing 2012 Olympics in London. So my goal is to work up a good enough algorithm for doing this process completely online. Even though it sadly won’t be working well enough to actually run online while the Olympics is going on (I’m still working on said algorithm), it could be pretty cool.
However, figuring out if you’re doing something right or wrong on many million tweets about 50 different sports is kind of challenging. So while i’m gathering ton of data to process, and then processing it all, I figured I should design a night UI for exploring the results. Being a terrible UI guy I thought I could never pull it off, but thanks to the magicians behind Crossfilter and D3.js, it turned out to be pretty easy. The result of my UI wizardry is currently here. And while there’s quite a lot more to add, such a way to select other sports or other summarization methods, it does the bulk of what I want:
- It builds histograms of tweet’s based on three dimensions: the date, the hour, and the “cluster” of the tweet.
- It lets you select sub-regions of these dimensions and automatically updates the histograms for other selection files. So if you put a range on the day, you can see the histograms according to hour and cluster for that date range.
- Given a range, you can also see the most representative, or summary, tweets for the most frequent clusters in that range. There’s still a little bit missing, I should really be ordering the summaries by their time, but that’ll come later.
The joy of making that UI.
Loading that dataset
1 2 3 4
1 2 3 4
D3 makes this super easy to handle. All you do is call
d3.csv(fileName, callback). In my example, this turns out to be:
1 2 3 4 5 6 7 8
1 2 3 4 5 6 7 8
Crossing the filters on that data
Once you’ve got data loaded, you gotta do something with it, no?
Crossfilter lets you do some super powerful things with very little work.
The primary job of cross filter is to take your array of objects and let you
select different dimensions to act as keys in that array. Initially your key is
just the index of the array. But after calling
dimension on a crossfiltered
object, you can select any variable in your object to be a key. Since I wanted
three charts, that means I need three keys: 1) a key on the day, 2) a key on the
hour, and 3) a key on the cluster id. I also want counts for the number of
tweets in the bins corresponding to each dimension. That sounds like a lot of
work, but it’s as easy as this:
1 2 3 4 5 6 7 8 9 10 11 12 13
That’s it! All you need is two lines to select a dimension for your chart and compute the data for the histogram. Easy Breezy.
Charting those groups
Now that you’ve got some dimensions set up and some counts to go along with them, it’s time to plot those fine numbers. For each chart you want, all you have to do is note what dimension you want to use, provide the summary counts, put some limits on the plots, then apply all that to some plotting object like a bar graph. I’m just using bar charts, but this other crossfilter example gives some sweet alternatives you can re-use.
1 2 3 4 5 6 7 8 9
Printing the tweet summaries
The fun part is printing all the summaries for the tweets that have been
selected. The original Crossfilter example was pretty simple, it just
printout out the actual rows being selected in the histograms. But I wanted to
do something more complicated. I wanted to figure out which clusters existed in
the selection, get the summaries attached to each cluster (and only one copy
of the summary per cluster), and then organize the summaries by date. Not
[Scala], it’s pretty easy to do with some groupBys and maps, but does
clusters object` computed to
print the histogram has nearly everything I want, the collection of cluster
identifiers found in the current filter selection. And since all arrays in
1 2 3 4 5 6
Next comes the cool part, creating a hierarchy on the cluster summaries based on the date. These two lines together do that magic:
1 2 3
The first line creates an object that will nest any array of items with a
startTime attribute according to their day and the second line runs that
nester over the cluster summaries to get a mapping from days to arrays of
summaries occuring on each day. Using that nested object, you can build a table
of tweet summaries for each day by attaching the data object to the list div
holding those tables:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42
And that’s it! Again, easy breazy.
Setting up the divs for the stuff you want
The last thing you need whenever you’re going to be mashing data into a website via D3 is some divs to host that data. For my application, I need just two types of divs: charts to hold the histograms and tables to hold the summaries. These look like:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
They’re pretty dead simple. One chart for each histogram I’m plotting and a general div for the lists. The lists will get populated with more divs dynamically based on how many dates fall into a selected range.
Creating the data to put in this app
So how did I get all these tweets? And how did I split them up into different clusters? That’s secret for now, but if my current research project is looking good, that’ll be come the topic of a new research paper, and if not, i’ll be the topic of a blog post describing what failed! So stay tuned!