The exciting world of data compression
I’m back at Brown and computer science is in the air (computer science is uncomfortably humid). My own research isn’t quite ready for a blog entry, but today was a red letter day for Hutter Prize recipient Alexander Ratushnyak. What is this “Hutter Prize” you ask? Well I’ll give you a hint, it involves Wikipedia and data compression. If you want to stop reading right here, I completely understand.
Ok, so suppose you wanted to store the Wikipedia entry for Beverly Hills Cop (guess what was on TV as I was writing this!). It begins:
Beverly Hills Cop (1984) is an Academy Award nominated American comedy film directed by Martin Brest and starring Eddie Murphy. The film shot Murphy to international stardom…
Certainly you could write down the entire entry word for word, but what if you were pressed for space? One thing to do is replace long words with abbreviations, or long phrases with short ones (e.g. “Oscars” for “Academy Awards”). You could also eliminate other redundant information. “1984″ could become “‘84″, since no movies were around in 1884.
If you just want to store is one entry, abbreviations are the way to go. If you look at thousands of entries, however, they’ll have many facts in common. To store all of Wikipedia as efficiently as possible, you’ll need to look at the content of the articles and start to identify the repeated information. The Hutter Prize asks you to do just that. Write a computer program that stores Wikipedia using as little space as possible. If you produce a 10 percent improvement over the current best program, you get 10 percent of 50,000 euros.
The Hutter Prize is definitely not a serious computer science award. It’s basically just a contest a few guys with PhD’s put on the web. Still, I think it’s an interesting idea. Why is this? Well, Wikipedia is filled with information about the real world, so a short description of Wikipedia is a short description of things in the world. By extension, the better a program is at storing Wikipedia, the more it understands about the world.
Storage space may seem like an odd way to measure understanding, but it’s really just a modern take on Occam’s Razor. Occam’s Razor states that: “All things being equal, the simplest explanation tends to be best.” The original Latin version went more like: “Entities should not be multiplied beyond necessity.” So a simple (or short) explanation of Wikipedia is more useful than a long convoluted one. So there you have it, some computer science with roots in 14th century England.