# Puzzle Time: Transforming Names

This post starts what will hopefully be a new trend: computational solutions to NPR’s Sunday Puzzle. In this weeks puzzle, we have to

Think of a common man’s name in four letters, one syllable. Move each letter exactly halfway around the alphabet. For example, A would become N, N would become A, and B would become O. The result will be a common woman’s name in two syllables. What names are these?

Word puzzles can often be pretty hard for automated computer programs. This is in part what made IBM’s Watson so darn cool. But thankfully this week’s puzzle is amazingly simple for a program, in fact, i’d even say it’s ideal for a program. To solve it, just need two simple steps:

1. Gather a bunch of male and female names
2. Process the male names as noted to see if we have any matching female names

Step 1 is perhaps the hardest since it requires gathering not only names, but common names in english. However, this handy site has done the work for us. Here they have listings of common English names based on a variety of categories: region, ranking, gender, alphabetical, etc. For our purposes, we’ll just grab the a long list of the most frequent male and female names from here and here. All the names are nicely laid out in a table with a pretty regular format. In fact the format is so regular we can write a regular expression for it. Here’s an example of the table:

Regular right? It’s so darn simple we can use some simple command line tools to clean it up:

If we downloaded the female names into female_names_alpha.htm, the first part will simply print out the contents of the html page to standard out. The second part, the grep command, will catch all rows in the table, which thankfully have the same start prefix. Part three extracts the names nestled in a row element. And finally the tail part eliminates the first row of the table, e.g. the head row, which grep accidentally catches. This’ll grab us a ton of names.

So what’s next? Well, we have a bunch of male and female names. But we’ve got too many, namely we have names that clearly won’t work as they have too many or too few letters. So let’s so some more cleaning, but this time in scala:

This dandy line reads the text from the file and filters out any lines that have more than four characters. Since our initial processing placed each name on a line, this in effect removes any names with more than four letters. If we do this with the female names, we’ll have lists of common four letter names for the two genders. For later use, we’ll also turn the female names into a set by calling .toSet on the list.

Last, we can do the transformation on each male name to see if it’s a valid female name and then print out any combinations that match. The next three lines do this slickly:

This works in three steps. First, the map call transforms each male name into a tuple which consists of the original male name, and then a potential female name by sliding up each letter in the name by 13 letters. With ascii characters, this is done just by first grounding the letter to 0, by subtracting the char value for A, adding 13, and then rolling over the value to be back in the range of the 26 letter characters by taking the modulo (%) and finally bumping the character value into the range of real ascii characters by adding A. That’s all done in the first line. Step two is to simply throw out any generated female names that don’t show up in our list of female names, done by line 3. And the final line just prints out any matches we get. With our name lists, this turns out to be:

The rules of the game stipulated that the male name had to have one syllable and the female name has to have two. We didn’t code for this at all since that parts slightly more complicated, and since our pre-processing reduced the set of options down to five, we can easily pick out a valid answer. In this case, “Glen” and “Tyra” or “Ryan” and “Elna” look like valid responses. I’d vote for the first pair since I haven’t heard “Elna” in forever.

# Making Agglomerative Clustering Run in N^2 Time

Let’s say you’re an amateur zoologist and you’ve got a bunch of data describing parliaments of owls, gaggles of geese, peeps of chickens, and a train of jackdaws but you don’t really know that you have these bird types. All you really have are descriptive features describing each bird like feature type, beak type, conservation status, feeding preferences, etc. Using just this data, you’d like to find out how many bird species you have and how similar each group is to the others in a nice graphical fashion like down below. How would you do it?

The classic solution to the problem would be to use Hierarchical Agglomerative Clustering. In this case, Agglomerative Clustering would group together birds that have similar features into hopefully distinct species of birds. And there’s a lot of packages out there that kind of do this for you like Weka, Scikit-Learn, or plain old R. However, you’ve got a lot of birds to deal with and these standard packages are just taking way too long. Why are they taking way too long? Because they’re doing agglomerative clustering the slow ways.

So what do they slow ways look like? Well, pretty much all ways of doing agglomerative clustering start by building an Affinity Matrix that simply measures how similar two birds are to each other:

This little snippet just compares each bird against all other birds and stores how “similar” they are to each other in terms of their descriptive features. What next? Well, you need another data structure to keep track of your bird groups. We’ll do this with just a map from group identifiers to sets of bird identifiers. And since this algorithm is named “agglomerative”, we gotta agglomerate things, so we’ll start by putting every bird in their own bird group:

So that was simple. Now comes the complicated bits that either lead you to a slow version, a fairly slow version, or a super slow version of agglomerative clustering. Before we do either of these options, let’s make two simplifications: let’s assume we just want the final groups that our bird show up in and we know how many bird groups we want to find. Agglomerative Clustering can give you not only this information, but a whole tree showing how birds are linked together, but the book-keeping for doing this is tricky and doesn’t really affect the issues we’re focusing on here. With that out of the way, what does the super slow method look like?

That’s it! It’s super simple and super slow. Why is it so slow? Well, in general, you’ll do the big while loop O(N) times, where N is your number of birds. Then the next block of code compares each possible pairing of bird groups. Since in general the number of groups is proportional to the number of birds, this will be O(N^2) comparisons. Throw these two bits together and you get a runtime of O(N3)! When you have 100,000 birds, that’s super slow.

So what can you do to hasten that up? Well, the main goal of the big loop through pairs of groups is to find the most similar pair of bird groups, so an obvious choice would be to create a priority queue, so let’s see how that looks:

Now this approach is definitely faster but it’s also horribly memory inefficient. Even if you have an array based priority queue that doesn’t allocate any extra memory for each entry other than the tuple being stored, this approach has one major problem: after every merge step O(N) elements immediately become invalidated. Since the two merged groups no longer exist, any comparison between them and other groups is moot. However the priority queue doesn’t easily permit removing them so they just float around. This is precisely why lines 14-16 are in a do while loop that ensures the returned groups exist. It’s highly likely that they wont. I think we can do better, both in terms of speed and in terms of memory. So let’s blaze through a faster method in couple of steps.

First, we need to change the setup of the algorithm. The previous two method just depended on some simple data structures. This approach requires two additional structures for bookkeeping: a way to track chains of nearest neighbors and a way to track the set of clusters not already a part of the chain.

The clusterMap structure replaces our birdMap but does the same thing, but the remaining set and the chain stack hold the crux of this approach. Instead of trying to find the best link between two clusters, we’re going to depend on chains of nearest neighbors, so the first node in the chain could be anything, but the second node is simply the nearest neighbor to the first node. To add the third node, we find the nearest neighbor out of any nodes not in the chain. We’ll keep doing this until two nodes in the chain represent reciprocal nearest neighbors, that is two nodes that are most similar to each other, and no other nodes in or outside of the chain. Upon finding these two nodes, we merge them, immediately. Then we just repeat the process until we have the desired number of clusters. In scala, this turns out to be pretty simple to do:

And that’s it! By focusing on reciprocal nearest neighbors, the algorithm merges together clusters that will always be merged, no matter how you find them. Furthermore, it’s remarkably easy and fast to find these nodes. By building the chain, the number of things that can go on the chain gets smaller and smaller.

Oh, but there’s one other magic trick to making this super fast, and it depends on how you compute updateSimilarity. The silly way to do it would be to traverse all the pairings between nodes in the new cluster and the each other remaining cluster, but that in itself gets really slow as the clusters grow. But rather than doing that, these folks suggest some recurrence relations that can be computed in constant time, for any of the existing agglomerative criteria methods. Crazy right? But just crazy enough to work correctly and be too fast to believe.

# Moral Dilemmas

What is the nature of a moral dilemma? How moral can any dilemma be? Are some dilemmas inherently moral free or can any problem be cast into some level of a moral quandary? And if that’s true, how do you get to this quandary state and where do you go from there?

These questions have been bubbling around in my mind for the past couple months, ever since I read a New York Times Op-Ed post about moral dilemmas and a new book called “Lost in Transition”. In the NYTimes post, the David Brooks notes that the authors of “Lost in Transition” found that modern high school students are unequipped to think about things in moral terms. For instance, The authors posed some questions to students and said the following:

Not many of them have previously given much or any thought to many of the kinds of questions about morality that we asked.

I haven’t read the book in detail, so I can’t really comment on their full findings, proposals, or methodology, so I’m not about to comment on that now. Instead, I got to asking myself the many questions up above. In particular, I wanted to know how I could pose daily activities as moral dilemmas. I don’t often put things into directly moral terms, but I was a Teaching Assistant for a class focused on Ethics in Engineering, so I should be equipped to do some pretty solid moral analysis if I apply myself.

So, I’d like to try and break down some seemingly non-moral problems in totally moral terms. And to start, let’s go with Doing the dishes.

Now, normally, you might look at the task of dishes and think that it’s just something you have to do. Often it’s a little boring, or even a little unpleasant based on your cooking forays, but it’s not too often that people feel morally conflicted about when and whether or not they do their dishes. But if you live with other people and share a common cooking space, I would argue that it is inherently a moral issue. Namely, it’s an issue about acting with empathy.

According to Wikipedia, the world’s coolest encyclopedia, empathy is defined as

the capacity to recognize and, to some extent, share feelings (such as
sadness or happiness) that are being experienced by another sentient or
semi-sentient being

So how does empathy relate at all to doing the dishes? Well, by living in an apartment shared with other people, your actions implicitly affect the livelihood and wellbeing of others. So let’s break down the task of dishes into two aspects: your choice of actions and the possible affects it can have on others. When it comes to doing dishes, there’s really two choices you can make:

1. do the dishes
2. don’t do the dishes

And everyone in the house will likely have one of two reactions to dirty dishes:

1. dirty dishes are not a problem
2. dirty dishes are frustrating

With regards to clean dishes, it’s fairly safe to assume that everyone enjoys and appreciates clean dishes, i’ve certainly not met anyone bothered by a clean plate or glass. So how do these reactions connect to whether or not you should make those dishes sparkly clean? Well, by leaving dirty dishes you’re doing one of two things:

1. you’re assuming that everyone is ok with dirty dishes

I would argue that both cases are a failure in empathy. By assuming that others share your same reaction to doing the dishes, you’re failing to recognize that others in the house may feel otherwise (first half of our empathy definition). If for instance, a roommate has a horrible case of OCD and emotionally breakdown upon the sight of food encrusted dishes, failing to do the dishes may make this particular person deeply wounded. And by simply ignoring how other’s feel about dishes (the second, sharing, half of empathy), even if you recognize that they may feel differently, you’re actively choosing not to share those feelings.

What happens if you do clean the dishes and leave things sparkly, fresh, and lemon scented? You’re still making an assumption about how other’s feel about dishes, but you’re making the most conservative assumption you can; most people prefer clean dishes so by cleaning you’re less likely to cause any emotional breakdowns, or in more normal cases, mild frustration. Assuming that you like to be happy, doing the dishes is the only empathy filled action available, it makes the fewest and most conservative assumptions about what makes others happy.

So that’s one way to break down the task of dishes into a moral dilemma. So the next time you make a mess in the kitchen and ask yourself “should I clean those?” You can make a handy debate about it.

# Morphological Analysis Made Easy With Scala

Ever since I got involved with distributional semantics, i’ve been perplexed about how to handle morphed words, which happens to be just about every noun and verb in English. What is a morphed word in English you ask? It’s pretty much any word that’s been changed to reflect things like past tense, plurality, ownership, and all those things. They’re conjugated verbs and more! But they always pose a massive problem in distributional semantics.

Think about it for a second, what’s the big difference between “cat” and “cats”? Not too much, the second is simply saying there are multiple occurrences of a “cat”. But what do people typically do for distributional semantics? One of two bad options: leave the two words as separate things and thus split the information gained about “cat” across two words or stem every word and throw away something important like the multitude of “cat”. Both seem totally wrong and unsatisfactory.

So on and off I’ve searched for a good tool to do morphological analysis for English. Ideally, the analyzer could recognize the word “cats” and split into two things: cat and <p> that way you can retain all your information about a cat and still know that there were a multitude of them when you see “cats”. And for verbs? You should see the same thing, ran could then become something like run and <past>” so that again, you still know everything about running and that it happened in the past. But until today, I’ve never found a tool that does this both quickly and easily in a usable language.

That all changes today. Today I found the Lltool-box for Java. It creates a large finite state machine for recognizing morphed words and figuring out the root word, i.e. “run” in “ran” and “cat” in “cats”, and how the word was morphed, i.e. “run” for “ran” and “cat” for “cats”. All you have to do is get a listing of how words are morphed, load it up until the Lttoolbox-java system and analyze away your sentences.

So let’s take a spin on how to do this. First, you read the wiki instructions for Lttoolbox-java, download it, and compile it. Compiling is easy:

The -DskipTests part seems needed since their unit tests don’t pass. But after that, you can start using the code in your favorite jvm based language. my personal fave is Scala, so let’s run with that. So what next? Now you create a Finite State Transducer using a dictionary file and tell it to analyze words:

Now that you have a transducer ready to analyze words, you just feed it stuff like this:

You gotta create a reader and writer for their transducer interface. It’s funky, but it still leaves you with a pretty flexible interface. So now what? What does the output look like? If you feed it “cats, dogs and blubber all running quickly!”, the output is pretty ugly at first:

Beastly no? So i figure that it’s best to write some regular expressions to handle all of this. You’ll need a couple: one to split up words, one to match analysed words, one for unrecognized words, and one to split up the tags. In scala you can do this pretty easily like this:

Then all you need to do is run through the analyzed sentence and split it up into separate tokens, some for root words and some for morphed features. You can do that like this:

This bit of code’s pretty sweet. You first iterate over each analyzed word with the first matcher. Then you match each word with the two word level regular expressions: one for fully analyzed words and one for unrecognized words. After that it’s easy smeesy, you just split the tags up with the last regular expression and turn it all into a list. The last two bits at the end turn the whole thing into a single list and lets you filter out tags or tokens you don’t want.

So with that, you can now do simple and fast morphological analysis in Java, Scala, or even Clojure (but who’d do something silly like that?)!-