Thursday, February 28, 2008

Netflix Prize: Wired Ariticle just published

On February 25, an article was published in Wired Magazine about the contestants of the Netflix Prize. It focuses on the BellKor team which won the Progress Prize and a psychologist, Gavin Potter, is apparently doing very well under the team name "Just a Guy in the Garage".

In my next blog, I will give the high level view of the BellKor algorithm.

References

Wednesday, February 27, 2008

Netflix Prize: The $1 Million Challenge

Participation in the NetFlix Challenge is simple. First, a user registers a team at NetFlixPrize.com. A team can consist of one or more persons. You go here to register.

Once registered, Netflix sends you the following data files:
  • a movie data set of over 17000 movies with three fields: (title, year, movie id).
  • a training data set that includes over 100 million ratings of 480,000 distinct users with four fields: (user id, movie id, rating, date)
  • a qualifying data set of 2.8 million entries with three fields: (user id, movie id, date) but no rating.
All data sets are presented as flat files. In the case of the training data set, there is a distinct flat file given for each movie.

Movie id consists of a number between 1 and the number of movies (over 17000). User id consists of a number between 1 and 3 million (over 480,000 distinct user ids). Rating is an integer between 1 and 5 where 1 is the lowest rating (worst movie) and 5 is the highest rating (best movie).

The task is to generate ratings for the 2.8 million entries of the qualifying data set based on the training data files. The generated ratings do not have to be integers so you can estimate 1.1, 0.5, 2.5, 5.5, etc for each record in the qualifying data set.

Each submission of generated ratings for the qualifying data set is then scored against the actual ratings (which are kept secret by NetFlix) using the root mean squared error (RMSE). If pi is the prediction and ri is the actual rating, then the RMSE score for a submitted qualifying data set is:



NetFlix provides a tool written in perl for calculating the RMSE. Of course, it is pretty easy to write your own tool since the above formula is equivalent to:

double calcRMSE(double[] p, double[] r) {

int size = p.length;
double sum = 0;

for (int i=0; i < size; i++) {
double diff = p[i] - r[i];
sum += diff*diff;
}

return Math.sqrt(sum/size);

}

Each team can make a submission of these predicted ratings once every 24 hours. Later, they will receive their RMSE score from NetFlix.

The top scores are posted on the web site on the Leaderboard. Interestingly, many of the teams on the leaderboard have hyperlinks to their web pages so if you are interested, you can browse their papers and their home pages.

To win the $1 million, a team needs to have a score lower than 0.8572. If more than one team has a score lower than 0.8572, then the team with the lowest score will win. If after one year, no team has a score lower than 0.8572, then the top score will win a progress prize of $50,000.

To claim any money, a team must submit their algorithm within one week of notification of their score. This algorithm must be wholly original or at least not involve any license requirements from third-party software sources. The full details for this contest can be found here.

At the time that I am writing this blog, the top RMSE score for a qualifying data set is: 0.8684.

So far, only one progress prize has been awarded and this was awarded to a team with the score of 0.8712. This team is called Korbell and here's their home page.

In my next blog, I will start reviewing the details of the algorithm that got to 0.8712. The team that produced that score is also the team that currently has the lowest score at 0.8684.

References

Friday, February 22, 2008

The $50,000 algorithm

Before jumping into the great work by Donald Knuth, I thought it would be interesting to start with analysis of a contemporary algorithm which has already given its authors $50,000.

NetFlix currently has a $1 million contest to anyone who can write an algorithm for predicting user ratings that beats their algorithm by at least 10%. This content has run since 2006 and so far, no one has done it.

Each year, if no one has claimed the top prize, NetFlix awards a $50,000 progress prize to the best algorithm to date. The first such prize was given this year. In order to claim the prize, the individual or team that wrote the algorithm must publish its details.

This year's prize went to Robert M. Bell, Yehuda Koren, and Chris Volinsky, members of the AT&T Research Labs. Considering that there are over 9,000 teams competing for the prize, AT&T can be very proud of their achievement.

As a starter to this blog, I thought it would be interesting to go in detail through the algorithm that won the 2007 Progress Prize.

References

Wednesday, February 20, 2008

Purpose of this blog

Last year, I was fortunate enough to hear the Mark Zuckerberg, the CEO of Facebook, speak at a conference that was organized by Paul Graham. One of the many points that he made was that for every position in his company, he wanted to hire a software engineer. The reasoning for this is in the internet marketplace, nimbleness is everything.

Mark explained that if everyone is a software engineer, then customer service can be more responsive, marketing will understand the technology being promoted, and product will understand deeply what is possible and what is not.

But what does this really mean? What is a software engineer?

Once upon a time, a software engineer was someone who coded. It was someone who you gave programming problems to, they created code, and they showed what they could do. Sometimes, there were math problems. All that has changed. Now, everything comes down to expert knowledge of algorithms and data structures. If you don't speak fluent O-notation, you may have trouble getting your next job at the technology companies in the forefront.

From my view, this blog is really an excuse to dive deep into algorithm and data structure analysis. I had previously started a math blog which to my surprise has been very well received.

I think that it makes sense for me to start a new blog focused on the mathematics of software engineering. Software engineering is rapidly changing. I don't know if in the technology company, everyone will be a software engineer. I do know that as time goes on, the professionalism and algorithmic expertise required for the engineering job will be raised to higher and higher levels.

As I did in my math blog, I will use this blog mostly as a reference for the classic works in software engineering. I will cover selections from the following works:


For more recent and interesting progress in computer science, I will also cover the following works:

In between, I hope to cover important snapshots from the current work of technology including:
  • Javascript/Ajax
  • Ruby/Rails
  • Python
  • Java/Spring/Open Source Projects
  • C#/.Net
At all times, feel free to post your comments and questions.

-Larry