Wednesday, February 27, 2008

Netflix Prize: The $1 Million Challenge

Participation in the NetFlix Challenge is simple. First, a user registers a team at NetFlixPrize.com. A team can consist of one or more persons. You go here to register.

Once registered, Netflix sends you the following data files:
  • a movie data set of over 17000 movies with three fields: (title, year, movie id).
  • a training data set that includes over 100 million ratings of 480,000 distinct users with four fields: (user id, movie id, rating, date)
  • a qualifying data set of 2.8 million entries with three fields: (user id, movie id, date) but no rating.
All data sets are presented as flat files. In the case of the training data set, there is a distinct flat file given for each movie.

Movie id consists of a number between 1 and the number of movies (over 17000). User id consists of a number between 1 and 3 million (over 480,000 distinct user ids). Rating is an integer between 1 and 5 where 1 is the lowest rating (worst movie) and 5 is the highest rating (best movie).

The task is to generate ratings for the 2.8 million entries of the qualifying data set based on the training data files. The generated ratings do not have to be integers so you can estimate 1.1, 0.5, 2.5, 5.5, etc for each record in the qualifying data set.

Each submission of generated ratings for the qualifying data set is then scored against the actual ratings (which are kept secret by NetFlix) using the root mean squared error (RMSE). If pi is the prediction and ri is the actual rating, then the RMSE score for a submitted qualifying data set is:



NetFlix provides a tool written in perl for calculating the RMSE. Of course, it is pretty easy to write your own tool since the above formula is equivalent to:

double calcRMSE(double[] p, double[] r) {

int size = p.length;
double sum = 0;

for (int i=0; i < size; i++) {
double diff = p[i] - r[i];
sum += diff*diff;
}

return Math.sqrt(sum/size);

}

Each team can make a submission of these predicted ratings once every 24 hours. Later, they will receive their RMSE score from NetFlix.

The top scores are posted on the web site on the Leaderboard. Interestingly, many of the teams on the leaderboard have hyperlinks to their web pages so if you are interested, you can browse their papers and their home pages.

To win the $1 million, a team needs to have a score lower than 0.8572. If more than one team has a score lower than 0.8572, then the team with the lowest score will win. If after one year, no team has a score lower than 0.8572, then the top score will win a progress prize of $50,000.

To claim any money, a team must submit their algorithm within one week of notification of their score. This algorithm must be wholly original or at least not involve any license requirements from third-party software sources. The full details for this contest can be found here.

At the time that I am writing this blog, the top RMSE score for a qualifying data set is: 0.8684.

So far, only one progress prize has been awarded and this was awarded to a team with the score of 0.8712. This team is called Korbell and here's their home page.

In my next blog, I will start reviewing the details of the algorithm that got to 0.8712. The team that produced that score is also the team that currently has the lowest score at 0.8684.

References

No comments: