Tuesday, July 22, 2008

BellKor Algorithm: Pearson Correlation Coefficient

Introduction

In their paper on the k-nearest neighbor (knn) algorithm, the BellKor team mention three choices that they considered for a similarity function: Pearson's Correlation Coefficient, the Cosine Coefficient, and the Root Mean Square Difference.

In today's blog, I will focus solely on the Pearson Correlation Coefficient. It is clear from the paper that they did a lot of work with Pearson. There is also a good explanation about it in the book Collective Intelligence.

In line with the BellKor article, I will focus on the item-item correlation. Newman! and Tillberg have reported that they can generate this statistic in 20 minutes on a single processor and others have reported generating it in 5 minutes on a quad-core processor. The algorithm that I describe in today's blog took 2 hours to run and required 1GB RAM on my machine. It look 45 minutes to load up into mysql. I wrote the algorithm in java. I suspect that a C/C++ algorithm can go faster and the code is not well optimized. My goal here is to present the algorithm in the clearest possible way.

MySQL Tables

It is assumed that a Rating table already exists in mysql that includes all the Netflix training data. In a previous blog, I went into detail on how I loaded this table from the Netflix data and removed the probe data. Here is the table:

Rating Table
  • movieid: int(2)
  • userid: int(3)
  • rating: int(1)
  • date: date
  • first: date
  • support: int(3)
  • avg: double
  • avg2: double
  • residual: double
In addition to this table, I created a table which I called Pearson. Here is the sql for creating this table:

CREATE TABLE pearson (movie1 int(2), movie2 int(2), num int(3), pearson float,
index index_pearson_movie1(movie1),
index index_pearson_pearson(pearson));

So, we end up with the following table:

Pearson Table
  • movie1: int(2)
  • movie2: int(2)
  • num: int(3)
  • pearson: float
Now, the tricky part is generating the pearson correlation coefficient for all 17,770 tables. That's a set of 17,770 x 17,770 = over 315 million records. On a very powerful machine, that may not be a problem but one a 1GB machine, if you do not code the algorithm correctly, the processing can easily take more than 2 hours.

But before jumping into the algorithm, let's take a look at the Pearson formula. One of the important optimizations comes down to simplifying the standard formula.

Pearson Formula

Here's the formula for computing the coefficient for two items x and y is the following:



Following the cue from the book Collective Intelligence by O'Reilly, it gets much simpler, if we decompose it into the following 5 variables:











Then, the formula becomes:



The result will be between -1 and 1 where the closer it is to 1, the closer the match. A movie compared to itself will score a 1.

Loading up the Ratings table into memory

In order to calculate the pearson coefficient quickly, we need to load up the Rating table into memory and we need to have it sorted by both movieid and userid.

Here is what Dan Tillberg writes in the Netflix Prize forum:
...the trick is to have trainer ratings sorted by both movie and by user. Then, to find movie-movie similarities for, say, movie m1, you look at all of the *users* that rated movie m1, and then you go through each of those users' *movies* that they've rated to fill in the entire m1'th row of the movie-movie similarity matrix (actually, you only need to bother with movies m2 <= m1, since the other half is symmetrical, as noted further below). I do this in three minutes on an quad-core processor. If you calculate it cell by cell instead of row by row it takes a lot longer.
I'll talk about the computation of the coefficient in the next section. On my 1GB machine, I needed to run java with the following options:


java -Xms950m -Xmx950m pearson > pearson.txt


So, the goal is to output the result to a file pearson.txt which I will later load up into MySQL.

To speed things up, I don't fully sort the ratings by userid and the movieid. Instead, I group all the ratings by userid and then I group all the ratings by movieid. This makes the load up into memory go significantly faster.

Here is the arrays that I used for the load up:


private static final int NUM_RECORDS=99072112;

byte[] ratingByUser = new byte[NUM_RECORDS];
short[] movieByUser = new short[NUM_RECORDS];

byte[] ratingByMovie = new byte[NUM_RECORDS];
int[] userByMovie = new int[NUM_RECORDS];



To group each rating, I use the following arrays:


private static final int NUM_USERS=480189;
private static final int NUM_MOVIES=17770;

int[] userIndex = new int[NUM_USERS+1];
int[] userNextPlace = new int[NUM_USERS];
int[] movieIndex = new int[NUM_MOVIES+1];
int[] movieNextPlace = new int[NUM_MOVIES];


Some Help from MySQL

To speed up the processing, I create two tables in MySQL that will provide me with the number of movies by user and the number of users by movie. These will be relatively small tables and that will help in the grouping process. I will use the term "support" in the same way BellKor where "movie support" is the number of users who rated each movie and "user support" is the number of movies rated each user.

Here are the two tables:

Movie_Support
  • movieid: int(2)
  • support: int(3)
User_Support
  • userid: int(3)
  • support: int(3)
Here's the sql for creating and loading up these tables:


CREATE TABLE movie_support (movieid INT(2), support INT(3));

CREATE TABLE user_support (userid INT(3), support INT(3));

INSERT INTO movie_support (movieid,support) SELECT movieid, COUNT(rating) FROM rating GROUP BY movieid;

INSERT INTO user_support (userid,support) SELECT userid, COUNT(rating) FROM rating GROUP BY userid;


We don't need any indexes for either of these tables.

Handling UserId

UserId presents an opportunity for reducing storage space. Now, the userid data presented as part of training varies from 6 to 2,649,429. But, there are only 480,189 distinct users.

To address this point, I use a simple hash table to map a relative userid (1 .. 480,189) to the absolute userid (6 .. 2,649,429). Here's the code to do this:


private static final int nextRelId=0;
private static HashMap<integer,integer> userMap = new HashMap<integer,integer>();

private static int getRelUserId(int userId) {
Integer relUserId = userMap.get(userId);
if (relUserId == null) {
userMap.put(userId,nextRelUserId);
relUserId = nextRelId++;
}
return relUserId;
}



Before Grouping Ratings by Movie and User

To prepare for the load up all the ratings and group them by userid and movieid, I used the following java code:


userIndex[NUM_USERS] = NUM_RECORDS;
int i=0;
ResultSet rs = stmt.executeQuery("SELECT userid,support FROM user_support");
while (rs.next()) {
int relUserId = getRelUserId(rs.getInt("userid"));
userIndex[relUserId] = i;
userNextPlace[relUserId] = i;
i += rs.getInt("support");
}
rs.close();

movieIndex[NUM_MOVIES] = NUM_RECORDS;
i = 0;
rs = stmt.executeQuery("SELECT movied,support FROM movie_support");
while (rs.next()) {
int relMovieId = rs.getInt("movieid")-1;
movieIndex[relMovieId]=i;
movieNextPlace[relMovieId]=i;
i += rs.getInt("support");
}
rs.close();


Once, I have these indices set, I am ready to start loading up. The advantage of these arrays will now become apparent. We do a very fast single pass through the entire Rating table.

Grouping the Ratings by User and Movie

Here's the java code for loading up the ratings:


rs = stmt.executeQuery("SELECT userid,movieid,rating FROM rating");
while (rs.next()) {
int userId = rs.getInt("userid");
int relUserId = getRelUserId(userid);
short movieId = rs.getShort("movieid");
int relMovieId = movieId-1;
byte rating = rs.getByte("rating");

movieByUser[userNextPlace[relUserId]] = movieId;
ratingByUser[userNextPlace[relUserId]] = rating;
userNextPlace[relUserId]++;

userByMovie[movieNextPlace[relMovieId]] = userId;
ratingByMovie[movieNextPlace[relMovieId]] = rating;
movieNextPlace[relMovieId]++;
}
rs.close();


Quickly Tallying the Ratings

Once we've loaded up ratings into memory, we are ready to process them. To do this efficiently, I use a technique that Newman! posted on the Netflix Prize forum. Here's Newman's description:

If you're using raw score values (1 to 5), you can speed things up significantly by replacing the PearsonIntermediate structure with:

unsigned int32 viewerCnt[5][5];

where viewerCnt[m][n] is the number of viewers who gave movie X a score of m+1 and movie Y a score of n+1. Furthermore, based on the total number of viewers of a movie, you can replace int32 with int16 or int8, speeding it up even further.

On my 1.5G laptop, it took 20 minutes to calculate 17770x17770 correlations and write out a big data file. Of course, you really need to calculate only 17770x17770/2, if you have enough memory to hold all correlation and related values (like average rating from common viewers) in memory, otherwise disk seeking will kill your running time and hard disk when you read the data file later. On a multi-core processor, you can easily parallelize the calculations. So I think on a fast multi-core system, you can calculate all movie correlations in less than 5 minutes.

So, here's my code:

int[][][] values = new int[NUM_MOVIES][5][5];

for (i=0; i < NUM_MOVIES-1; i++) {
for (int j=i+1; j < NUM_MOVIES; j++) {
for (int k=0; k < 5; k++) {
for (int l=0; l < 5; l++) {
values[j][k][l]=0;
}
}
}

for (int j=movieIndex[i]; j < movieIndex[i+1]; j++) {
int relUserId = getRelUserId(userByMovie[j]);
for (int k=userIndex[relUserId]; k < userIndex[relUserId+1]; j++) {
if (movieByUser[k]-1 > i) {
values[movieByUser[k]-1][ratingByUser[k]-1][ratingByMovie[j]-1]++;
}
}
}
}

Shrinking the Pearson Coefficient in relation to data available

BellKor state that they shrink the pearson coefficient based on the the count of data available. They use the following formula in the article:



Where |U(i,j)| is "the set of users hwo rated both items j and k".

In the original article, they state that they used "some small α." In the NetPrize Forum, Yehuda Koren thinks that they "used alpha values around 5-10." In my code, I use α = 10.

Computing the Pearson

With this tally, I am now able to compute the Pearson coefficient for each movie in comparison to i.

Here's the code:


for (i=0; i < NUM_MOVIES-1; i++) {
// Insert the code from above for tallying the totals...
//...
for (int j=i+1; j < NUM_MOVIES; j++) {
float sum1=0;
float sum2=0;
float sumsq1=0;
float sumsq2=0;
float sumpr=0;
float num=0;
for (int k=1; k <= 5; k++) {
for (int l=1; k <= 5; l++) {
int val=values[j][k-1][l-1];
sum1 += l*val;
sum2 += k*val;
sumsq1 = l*l*val;
sumsq2 = k*k*val;
sumpr = k*l*val;
num += val;
}
}
float top = sumpr - (sum1*sum2/num);
float bottom = (float)Math.sqrt((sumsq1 - (sum1*sum1)/num)*(sumsq2 - (sum2*sum2)/num));
if (bottom != 0) {
float pearson = (top/bottom)*(num/(num+10));
System.out.print((i+1)+","+(j+1)+","+num+","+pearson+"\n");
System.out.print((j+1)+","+(i+1)+","+num+","+pearson+"\n");
}
else {
System.out.print((i+1)+","+(j+1)+","+num+","+0+"\n");
}
}
}


Loading Pearson Coefficients into MySQL


Here's the sql that I used to load back up into MySQL


use netflix;

ALTER TABLE pearson DISABLE KEYS;

LOAD DATA LOCAL INFILE 'pearson.txt' INTO TABLE pearson
FIELDS TERMINATED BY ','
LINES TERMINATED BY '\n'
(movie1,movie2,num,pearson);

ALTER TABLE pearson ENABLE KEYS;


That's pretty much it.

Using Your Pearson Table

For example to compute 10 movies that are most like The Adventures of Robin Hood (movieid 16840), we do the following:

SELECT movie1,pearson FROM pearson WHERE movie2=7064 ORDER BY pearson DESC LIMIT 10;

Now, I ran my pearson against the entire training data so your results may be different.

Here are my results:



References