Research papers. I hate them sometimes. They present a great idea, talk about how it can be used and applied, and then give only the barest description of how to actually build and implement the idea, often with no pointers or links to what they built. My current frustration is with building a Phrase Graph. The idea behind phrase graphs are pretty simple, they encode a large set of sentences with a minimal automata.
Here’s a simple example. Say you have the following two sentences:
#archery by the Republic of Korea and the guy is legally blind. #archery by the Republic of Korea and the guy who is legally blind. #archery by the Republic of Korea in archery by a guy who is legally blind.
There’s quite a lot of overlap at the start of the sentence and at the end of the sentence. So a good phrase graph would look something like this:
So you can see that the nodes in the graph represent words found in the sentences observed and if you weight the nodes based on how often they are traversed, you can start to detect which phrases are used most frequently. But how does one do this? And how does one do this efficiently? This is where research papers make me mad. They fail to point out the simplest algorithms for building these awesome ideas. To complement these research ideas, this post’ll give a little more detail on what these phrase graphs are, an easy way to build them using existing libraries, and code to write your own custom phrase graph!
Badly described methods for building a phrase graph
The first paper I read for building phrase graphs, titled Summarizing Sporting events using Twitter, gives this highly detailed algorithm description:
The phrase graph consists of a node for each word appearing in any status update, and an edge between each set of two words that are used adjacently in any status update
Seems easy to implement, no? Here’s a more detailed algorithm, found in Experiments in Microblog Summarization:
To construct the left-hand side, the algorithm starts with the root node. It reduces the set of input sentences to the set of sentences that contain the current node’s phrase. The current node and the root node are initially the same. Since every input sentence is guaranteed to contain the root phrase, our list of sentences does not change initially. Subsequently, the algorithm isolates the set of words that occur immediately before the current node’s phrase. From this set, duplicate words are combined and assigned a count that represents how many instances of those words are detected. For each of these unique words, the algorithm adds them to the graph as nodes with their associated counts to the left of the current node.
This gives a lot more detail on what the phrase graph contains, and an easy enough algorithm, but it’s not exactly a fast algorithm, especially if you want to do this using 10 million tweets about the Olympics. Both descriptions leave out a key detail: these phrase graphs are really just Compressed Tries.
Tries and their compressed cousins
Tries are one of the simplest data structures, and one of the most powerful when processing natural languages. Given a set of words or sentences, a Trie is essentially a standard tree where the leaves represent observed words or sentences. The power of it is that each internal node in the Trie represents overlapping sequences. So if you want to build a Trie for an English Dictionary, the root node would be a blank character, which then points to a node for each letter of the alphabet. From the “a” child, you would then have access to all words starting with “a”, and the further down the Trie you go, you get longer prefixes of words.
Now a Phrase Graph is essentially a Trie which condenses not only shared prefixes, but also any shared subsequence, be they in the middle, or the end. Formally, they are directed acyclic graphs, and if they are treated as a lookup structure, ala a dictionary, they are called Minimal Acyclic Finite-State Automata. And there’s plenty of fast and simple ways to build these things. The easiest places to start reading about these is Incremental Construction of Minimal Acyclic Finite State Automata. The Brics Automaton package also provides a really good implantation for these that works
1 2 3 4 5
The above code snippet will generate this simple minimal automata:
*Note, you may want to open this in a new tab to zoom in as every letter has it’s own state.
Rolling your own automata builder
Using Brics works really well if you just want to check whether or not a sentence matches one seen in a corpus. However, it doesn’t easily let you check how often particular sub-phrases are used within the corpus. For that kinda power, you’ll have to craft your own implemenation. And now it’s time to share the very code to do this!
Nodes in the graph, that’s where.
Where does one start? First, you need a node data structure, with some very carefully crafted functions to determine equality (which indidentally, the research papers don’t point out).
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101
PhraseNode has three fairly simple data members, a label that records
which word the node represents, the weight of the node, and a map from this node
to it’s children based on their labels. The tricky part of this node is how you
determine equality. Two nodes can be equal in two different senses: 1) they are
the exact same data structure in memory, and so their memory locations are the
same or 2) they have the same label and point to the same exact children in
memory. Checking the first type of equality is easy, you can compare the hash
code of their addresses using the default
hashCode method java provides for
every object. Checking the second form of equality is more challenging to do
efficiently. The naive way would be to recursively check that all children
eventually point to sub-graphs with the same structure. However, checking the
hash code of the pointers of each children is much faster and accomplishes
the same goal. Hence, this is why we override
equals with such
Linking together those Phrase Nodes
The algorithm for linking together
PhraseNodes such that they form a minimal
transducer relies on a few interesting tricks and beautiful recursion. The
first trick we need is lexicographically sorted input. By sorting the input,
you’re maximizing the size of matching prefixes between any neighboring words you want
to put in the transducer. So let’s look at how we do that adding in sorted
Before we get there though, let’s flesh sketch out a
structure. It’s pretty simple. It starts off with just having a single
PhraseNode element, the root.
1 2 3 4 5 6 7 8 9 10 11 12 13 14
Now comes adding entries. Since we want a clean and easy to use interface, we’ll be defensive and assume the elements aren’t sorted, but they are already tokenized, so each element in the given list is a sequence of tokens. How you sort thoes beasts is a homework assignment.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44
These methods are mostly simple.
addSuffix starts adding links to nodes
starting from some initial point, note that nodes will automatically create a
new node for a word if one doesn’t already exist.
computeDeepestCommonNodeAndSuffix walks down the Trie starting at the root
consuming each token that has a node and returns the deepest node reachable,
i.e. finds the node with the longest common prefix with a given sequence of
tokens. Finally adding a single tweet depends on getting the prefix, doing some
replaceOrRegister and then adding the suffix to the last node in
the longest prefix. So, only question left, what is this registry business?
The registry keeps track of all nodes in the graph after they’ve been validated.
And what does validation entail? It involves checking wether or not an existing
node already exists in the registry. If one does, you simply replace that
duplicate node with the one in the registry. If no such node exists, in goes
the node. And this is exactly what
replaceOrRegister does. To do this
efficiently and correctly, we call
replaceOrRegister on the last node in our
comment prefix and walk all the way down along the most recently added path,
i.e. the added by the last element we added, and then zip up any matching
nodes which correspond to matching suffixes. By starting at the bottom, we match together end points which have no
children and merge them.
Take our archery example above, all three sentences end with “is legally blind.” After we add the first sentence, there would be a node for each token in the order of the sentence. When we add the second sentence and walk down to the end, we see that “blind.” has a duplicate, which we can merge. Taking one step backwards, we’ll see that “legally” also has an exact match, where two nodes with the same label point to the same exact node, the node we just merged. And then thanks to recursion, we keep zipping things along until we get to “who”, which has no exact match, and we can stop zipping. Walking through an example like this should make the code below a little clearer.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32
And Voila, we now have all the code needed to make a phrase graph!
Tweaking the automata to condense more phrases
BUT! Suppose you want something more minimal? Suppose you think it’s kinda funny that interjection of “who” prevents “guy SOMETHING OR NOTHING is” from being a phrase. Or you try adding in the sentence
#archery ZOIDBERG by the Republic of Korea and the guy who is legally blind.
and notice how it creates an entirely new branch for “the Republic of Korea” starting at “ZOIDBERG”, thus making the number of times you think you’ve seen that phrase dependent on the previous tokens. Can we fix this? YES! All we have to do is relax our definition of finding a matching element in the registry to finding a node whose outgoing links are a superset of the most recently added children.
And since Scala is awesome, we can do this with minimal effort.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27
All we had to do was update the contractor to take in a boolean, then create a
new data member that links to one of two comparison functions for pairs of
nodes: 1) an exact matching function, which we would use for a true compressed
trie and 2) a subset matching function, to get our even more compressed
sorta-trie. If we swap in
subsumeMatch, we now get this phrase graph:
Let’s see how the two versions handle this as input:
Republic of Korea in archery by a guy who is legally blind #archery by the Republic of Korea and by the guy is legally blind #archery by the Republic of Korea and the guy is legally blind #archery by the Republic of Korea in archery by a guy who is legally blind #archery zoidberg by the Republic of Korea and by the guy is legally blind #archery zoidberg by the Republic of Korea in archery by a guy who is legally blin
Using exact Matching:
Using link subset Matching:
Finally! This second version is precisely the data structure those three original papers were describing.