Tuesday, July 22, 2008

BellKor Algorithm: Pearson Correlation Coefficient

Introduction

In their paper on the k-nearest neighbor (knn) algorithm, the BellKor team mention three choices that they considered for a similarity function: Pearson's Correlation Coefficient, the Cosine Coefficient, and the Root Mean Square Difference.

In today's blog, I will focus solely on the Pearson Correlation Coefficient. It is clear from the paper that they did a lot of work with Pearson. There is also a good explanation about it in the book Collective Intelligence.

In line with the BellKor article, I will focus on the item-item correlation. Newman! and Tillberg have reported that they can generate this statistic in 20 minutes on a single processor and others have reported generating it in 5 minutes on a quad-core processor. The algorithm that I describe in today's blog took 2 hours to run and required 1GB RAM on my machine. It look 45 minutes to load up into mysql. I wrote the algorithm in java. I suspect that a C/C++ algorithm can go faster and the code is not well optimized. My goal here is to present the algorithm in the clearest possible way.

MySQL Tables

It is assumed that a Rating table already exists in mysql that includes all the Netflix training data. In a previous blog, I went into detail on how I loaded this table from the Netflix data and removed the probe data. Here is the table:

Rating Table
  • movieid: int(2)
  • userid: int(3)
  • rating: int(1)
  • date: date
  • first: date
  • support: int(3)
  • avg: double
  • avg2: double
  • residual: double
In addition to this table, I created a table which I called Pearson. Here is the sql for creating this table:

CREATE TABLE pearson (movie1 int(2), movie2 int(2), num int(3), pearson float,
index index_pearson_movie1(movie1),
index index_pearson_pearson(pearson));

So, we end up with the following table:

Pearson Table
  • movie1: int(2)
  • movie2: int(2)
  • num: int(3)
  • pearson: float
Now, the tricky part is generating the pearson correlation coefficient for all 17,770 tables. That's a set of 17,770 x 17,770 = over 315 million records. On a very powerful machine, that may not be a problem but one a 1GB machine, if you do not code the algorithm correctly, the processing can easily take more than 2 hours.

But before jumping into the algorithm, let's take a look at the Pearson formula. One of the important optimizations comes down to simplifying the standard formula.

Pearson Formula

Here's the formula for computing the coefficient for two items x and y is the following:



Following the cue from the book Collective Intelligence by O'Reilly, it gets much simpler, if we decompose it into the following 5 variables:











Then, the formula becomes:



The result will be between -1 and 1 where the closer it is to 1, the closer the match. A movie compared to itself will score a 1.

Loading up the Ratings table into memory

In order to calculate the pearson coefficient quickly, we need to load up the Rating table into memory and we need to have it sorted by both movieid and userid.

Here is what Dan Tillberg writes in the Netflix Prize forum:
...the trick is to have trainer ratings sorted by both movie and by user. Then, to find movie-movie similarities for, say, movie m1, you look at all of the *users* that rated movie m1, and then you go through each of those users' *movies* that they've rated to fill in the entire m1'th row of the movie-movie similarity matrix (actually, you only need to bother with movies m2 <= m1, since the other half is symmetrical, as noted further below). I do this in three minutes on an quad-core processor. If you calculate it cell by cell instead of row by row it takes a lot longer.
I'll talk about the computation of the coefficient in the next section. On my 1GB machine, I needed to run java with the following options:


java -Xms950m -Xmx950m pearson > pearson.txt


So, the goal is to output the result to a file pearson.txt which I will later load up into MySQL.

To speed things up, I don't fully sort the ratings by userid and the movieid. Instead, I group all the ratings by userid and then I group all the ratings by movieid. This makes the load up into memory go significantly faster.

Here is the arrays that I used for the load up:


private static final int NUM_RECORDS=99072112;

byte[] ratingByUser = new byte[NUM_RECORDS];
short[] movieByUser = new short[NUM_RECORDS];

byte[] ratingByMovie = new byte[NUM_RECORDS];
int[] userByMovie = new int[NUM_RECORDS];



To group each rating, I use the following arrays:


private static final int NUM_USERS=480189;
private static final int NUM_MOVIES=17770;

int[] userIndex = new int[NUM_USERS+1];
int[] userNextPlace = new int[NUM_USERS];
int[] movieIndex = new int[NUM_MOVIES+1];
int[] movieNextPlace = new int[NUM_MOVIES];


Some Help from MySQL

To speed up the processing, I create two tables in MySQL that will provide me with the number of movies by user and the number of users by movie. These will be relatively small tables and that will help in the grouping process. I will use the term "support" in the same way BellKor where "movie support" is the number of users who rated each movie and "user support" is the number of movies rated each user.

Here are the two tables:

Movie_Support
  • movieid: int(2)
  • support: int(3)
User_Support
  • userid: int(3)
  • support: int(3)
Here's the sql for creating and loading up these tables:


CREATE TABLE movie_support (movieid INT(2), support INT(3));

CREATE TABLE user_support (userid INT(3), support INT(3));

INSERT INTO movie_support (movieid,support) SELECT movieid, COUNT(rating) FROM rating GROUP BY movieid;

INSERT INTO user_support (userid,support) SELECT userid, COUNT(rating) FROM rating GROUP BY userid;


We don't need any indexes for either of these tables.

Handling UserId

UserId presents an opportunity for reducing storage space. Now, the userid data presented as part of training varies from 6 to 2,649,429. But, there are only 480,189 distinct users.

To address this point, I use a simple hash table to map a relative userid (1 .. 480,189) to the absolute userid (6 .. 2,649,429). Here's the code to do this:


private static final int nextRelId=0;
private static HashMap<integer,integer> userMap = new HashMap<integer,integer>();

private static int getRelUserId(int userId) {
Integer relUserId = userMap.get(userId);
if (relUserId == null) {
userMap.put(userId,nextRelUserId);
relUserId = nextRelId++;
}
return relUserId;
}



Before Grouping Ratings by Movie and User

To prepare for the load up all the ratings and group them by userid and movieid, I used the following java code:


userIndex[NUM_USERS] = NUM_RECORDS;
int i=0;
ResultSet rs = stmt.executeQuery("SELECT userid,support FROM user_support");
while (rs.next()) {
int relUserId = getRelUserId(rs.getInt("userid"));
userIndex[relUserId] = i;
userNextPlace[relUserId] = i;
i += rs.getInt("support");
}
rs.close();

movieIndex[NUM_MOVIES] = NUM_RECORDS;
i = 0;
rs = stmt.executeQuery("SELECT movied,support FROM movie_support");
while (rs.next()) {
int relMovieId = rs.getInt("movieid")-1;
movieIndex[relMovieId]=i;
movieNextPlace[relMovieId]=i;
i += rs.getInt("support");
}
rs.close();


Once, I have these indices set, I am ready to start loading up. The advantage of these arrays will now become apparent. We do a very fast single pass through the entire Rating table.

Grouping the Ratings by User and Movie

Here's the java code for loading up the ratings:


rs = stmt.executeQuery("SELECT userid,movieid,rating FROM rating");
while (rs.next()) {
int userId = rs.getInt("userid");
int relUserId = getRelUserId(userid);
short movieId = rs.getShort("movieid");
int relMovieId = movieId-1;
byte rating = rs.getByte("rating");

movieByUser[userNextPlace[relUserId]] = movieId;
ratingByUser[userNextPlace[relUserId]] = rating;
userNextPlace[relUserId]++;

userByMovie[movieNextPlace[relMovieId]] = userId;
ratingByMovie[movieNextPlace[relMovieId]] = rating;
movieNextPlace[relMovieId]++;
}
rs.close();


Quickly Tallying the Ratings

Once we've loaded up ratings into memory, we are ready to process them. To do this efficiently, I use a technique that Newman! posted on the Netflix Prize forum. Here's Newman's description:

If you're using raw score values (1 to 5), you can speed things up significantly by replacing the PearsonIntermediate structure with:

unsigned int32 viewerCnt[5][5];

where viewerCnt[m][n] is the number of viewers who gave movie X a score of m+1 and movie Y a score of n+1. Furthermore, based on the total number of viewers of a movie, you can replace int32 with int16 or int8, speeding it up even further.

On my 1.5G laptop, it took 20 minutes to calculate 17770x17770 correlations and write out a big data file. Of course, you really need to calculate only 17770x17770/2, if you have enough memory to hold all correlation and related values (like average rating from common viewers) in memory, otherwise disk seeking will kill your running time and hard disk when you read the data file later. On a multi-core processor, you can easily parallelize the calculations. So I think on a fast multi-core system, you can calculate all movie correlations in less than 5 minutes.

So, here's my code:

int[][][] values = new int[NUM_MOVIES][5][5];

for (i=0; i < NUM_MOVIES-1; i++) {
for (int j=i+1; j < NUM_MOVIES; j++) {
for (int k=0; k < 5; k++) {
for (int l=0; l < 5; l++) {
values[j][k][l]=0;
}
}
}

for (int j=movieIndex[i]; j < movieIndex[i+1]; j++) {
int relUserId = getRelUserId(userByMovie[j]);
for (int k=userIndex[relUserId]; k < userIndex[relUserId+1]; j++) {
if (movieByUser[k]-1 > i) {
values[movieByUser[k]-1][ratingByUser[k]-1][ratingByMovie[j]-1]++;
}
}
}
}

Shrinking the Pearson Coefficient in relation to data available

BellKor state that they shrink the pearson coefficient based on the the count of data available. They use the following formula in the article:



Where |U(i,j)| is "the set of users hwo rated both items j and k".

In the original article, they state that they used "some small α." In the NetPrize Forum, Yehuda Koren thinks that they "used alpha values around 5-10." In my code, I use α = 10.

Computing the Pearson

With this tally, I am now able to compute the Pearson coefficient for each movie in comparison to i.

Here's the code:


for (i=0; i < NUM_MOVIES-1; i++) {
// Insert the code from above for tallying the totals...
//...
for (int j=i+1; j < NUM_MOVIES; j++) {
float sum1=0;
float sum2=0;
float sumsq1=0;
float sumsq2=0;
float sumpr=0;
float num=0;
for (int k=1; k <= 5; k++) {
for (int l=1; k <= 5; l++) {
int val=values[j][k-1][l-1];
sum1 += l*val;
sum2 += k*val;
sumsq1 = l*l*val;
sumsq2 = k*k*val;
sumpr = k*l*val;
num += val;
}
}
float top = sumpr - (sum1*sum2/num);
float bottom = (float)Math.sqrt((sumsq1 - (sum1*sum1)/num)*(sumsq2 - (sum2*sum2)/num));
if (bottom != 0) {
float pearson = (top/bottom)*(num/(num+10));
System.out.print((i+1)+","+(j+1)+","+num+","+pearson+"\n");
System.out.print((j+1)+","+(i+1)+","+num+","+pearson+"\n");
}
else {
System.out.print((i+1)+","+(j+1)+","+num+","+0+"\n");
}
}
}


Loading Pearson Coefficients into MySQL


Here's the sql that I used to load back up into MySQL


use netflix;

ALTER TABLE pearson DISABLE KEYS;

LOAD DATA LOCAL INFILE 'pearson.txt' INTO TABLE pearson
FIELDS TERMINATED BY ','
LINES TERMINATED BY '\n'
(movie1,movie2,num,pearson);

ALTER TABLE pearson ENABLE KEYS;


That's pretty much it.

Using Your Pearson Table

For example to compute 10 movies that are most like The Adventures of Robin Hood (movieid 16840), we do the following:

SELECT movie1,pearson FROM pearson WHERE movie2=7064 ORDER BY pearson DESC LIMIT 10;

Now, I ran my pearson against the entire training data so your results may be different.

Here are my results:



References

29 comments:

teCh poVerA said...

Impressive write-up once again.

One slightly off topic question. In their progress prize paper they mention:

"Another kind of similarity is based solely on who-rated-what (we refer to this as "support-based" or "binary-based" similarity). The full details are given in an Appendix 1."

"Interestingly, when post-processing residuals of factorization or RBMs, these seemingly inferior support-based similarities led to more accurate results."

I have found pearson not to be a good similarity measure when post processing residuals of other methods. I wonder if you do understand what they do on "Appendix 1". I'm also asking about it in here:

http://www.netflixprize.com/community/viewtopic.php?id=998

Larry Freeman said...

I'm glad you liked my write up.

As for your question, I have not thought in enough detail about Appendix 1 to give you a good answer.

It will be interesting to see how people respond on the Netflix Prize Forum.

When I have a good answer, I will post it here or respond to your question in the Netflix Prize Forum.

Cheers,

-Larry

by32768 said...

Hi, Newman! here. I didn't mention that replacing int32 with int16 or int8 gave me the most performance boost, no doubt because absolute majority of viewers have less than 65536 ratings, and a large portion of them (more than 40% if I remember correctly) have less than 256 ratings, so the amount of memory you access is significantly less. In my C++ implementation, I just wrote a template function and have one copy each for int32, int16, and int8.

Also, the optimal data order for values[][][] is not [NUM_MOVIES][5][5], but [5][NUM_MOVIES][5], this is a cache miss/memory access pattern thing, but it doesn't help as much as int32/int16/int8 trick.

Larry Freeman said...

Hi Newman!

Thanks very much for the clarifications on your algorithm!

Cheers,

-Larry

Unknown said...
This comment has been removed by the author.
Unknown said...

I've been following your blog and loved the level of detail you covered on this topic. I tried to reproduce your results, but am getting OutOfMemoryErrors before my query can get all the data from the ratings table. I tried doubling your recommended heap size and that didn't do it. I even tried using stmt.setFetchSize(num_rows) which suprisingly didn't fix my problem either.

Any ideas?

Regards,
Taylor

Larry Freeman said...

How much memory does your machine have? Also, are you running any other applications at the same time? For example, I found that I needed to close down Firefox to get it working.

I was able to run my algorithm on a machine with 1GB memory.

-Larry

Unknown said...

Thanks for the quick response. I actually just figured out the problem was indeed that JDBC was trying to load the entire ratings query result into memory. Statement.setFetchSize(1) didn't solve my problem, but I learned that strangely Statement.setFetchSize(Integer.MinValue) did! Unless this is some strange behavior by my version of the JDBC drivers, I'm not sure why this works.

-Taylor

Larry Freeman said...

Yes, that's a well known problem with MySQL's version of JDBC. I'm glad that you figured it out.

Here's the bug report.

-Larry

Unknown said...

hi, thank you very much for your writeups!

FYI... if you run the knn on residuals from the global effects, adventures of robin hood suddenly has very reasonable continuous-like neighbors. that is--bonus mt'l, pirates of caribbean, indep day, lor 123, american beauty, and so on. I tried to email you, but can't find your email anywhere.

Anonymous said...

Hi Larry,

Many thanks for this code - I also used your sql setup and it worked great. I don't know much about java and I'm having trouble compiling your code. Is there any chance you could post it as a single file?

- Andy

Larry Freeman said...

Hi Andy,

Thanks for your feedback.

I'll try to put all the java code in a single place in the next week or so. I'll post the link here when I do.

-Larry

Anonymous said...

That would be great, Larry. Thank you again.

MikeM said...

It is not clear to me how global effects is used in the pearson calculation, i.e. no use is made of 'residual'. I'm about to do #4 of your previous blog and came to a grinding halt when I realized that this is not used in the pearson calculation at all described in this blog. Am I missing something?

Anonymous said...

MikeM, I might be wrong, I haven't read Mike's other page. But the idea I got from reading the papers was that you'd run the pearson on the data set after you've applied the global effects - they are not predictive models in and of themselves. The GE iron out anomolies and bumps in the data, the pearson ties them together.

MikeM said...

Just realized that Larry answered my question in his previous blog. I.e. you do rating minus all the effects, then predit using your knn algorithm and then add the effects back in to produce the qualifying result. This looks reasonable.

Anonymous said...

I managed to get the code up and compiled, after correcting several apparent mistakes with the iterators. But I'm getting some kind of exception at the 8,17770 mark, and the pearson results don't look at all right. Is there any chance of this code being updated to something usable? The whole tally part mystifies me and I just can't work it out :/

Larry Freeman said...

Sorry for the problems with the code. I have not figured out an easy way to get code into Blogger. It insists on reformatting it.

Here's the essence of the code:

(1) int[] usersOrderdByMovie

Assume this is an array of users ordered by movie.

(2) int[] movieIndex

Assume that this is an index that maps movies to usersPerMovie.

(3) int[] numUsersPerMovie

This is a number of users who rated a movie.

In this way, we can iterate through all users for a movie 10 by using:

int base=movieIndex[10];
int n = numUsersPerMovie[10];
for (int i=base; i < base+n; i++) {
int userId = usersOrderedByMovie[i];
}

We can likewise do the same for users so that we have:

(4) int[] moviesOrderedByUser

(5) int[] userIndex

(6) int[] numMoviesPerUser

Now here's the algorithm for calculating the Pearson for all movies:


double[] similarity = new double[NMOVIES];

short[][][] values = new short[NMOVIES][5][5];

for (int movieid=1; movieid < 17771; movieid++) {

for each user (u) who rated this movie {

for each other movie (m != movieid) rated by this user {

let r = rank for u,m
values[m][u][r]++;

}

}


}

That's the essence of the algorithm. The tally is just a count to make it easier to order the usersOrderedByMovie and the moviesOrderedByUser.

-Larry

Anonymous said...

Hi Larry,

Thank you for your further explanation, I think I see what I'm supposed to do; but I'm still not sure why your code is stopping at 8,17770. I've posted a copy here with all the sql access stuff included.
http://paste.pocoo.org/show/115007/

The only changes I've made are to correct some typos, and these lines:

"for (int k=userIndex[relUserId]; k < userIndex[relUserId+1]; j++)" - where I changed j++ to k++. And:

for (int l=1; k <= 5; l++) where I changed k to l.

Are these corrections correct?
The exception comes at 8,17770 when trying to update the "values[movieByUser[k]-1][ratingByUser[k]-1][ratingByMovie[j]-1]++;"

Anonymous said...

err, anyone else trying to use the code I pasted should probably remove the 3 "System.out.println(+new Date());" lines if you're running from the command line. I've also limited the final sql rating query to 0,1000000 to make the loadup quicker for debugging, so remove that. This has no apparent effect on the exception.

Anonymous said...

Hi Larry,

Ok, I finally got this up and running for me, so I thought I'd post a link to the complete job:

http://pastebin.com/f315f7bcb

This includes all the server access garb, some time outputs to let you see the progress, and FileWriter outputs. I ran this from Run in Eclipse as, not knowing anything about Java, I couldn't get it to compile from the command line ;)

As I said before, I had to change a couple of the k & l iterators. In addition, in order to get sane results I had to change sumsq1, sumsq2 and sumpr to += and deleted the +10 from "float pearson = (top/bottom)*(num/(num+10))".

Many thanks for all your work - I'd still be nowhere without it.

Andy

Anonymous said...

Oh! I see - the +10 is the alpha value for the shrinking of the pearson. Sorry, missed that bit. I couldn't work out why comparing a movie against itself wasn't coming out with 1.0 with that in there - but I saw after that without this, you get lots of 1.0 values for movies with low support.

Sorry for spamming your blog with my noobishness, Larry.

by32768 said...

Larry, what's the probe and qualifying RMSE of your KNN algorithm ?

Larry Freeman said...

Hi Newman!,

I've haven't completed the KNN algorithm yet so I don't have an answer for you.

Recently, I've been working with SVD, SVD++, and the Brism.

I am planning to finish my work with KNN after I get SVD++ working.

-Larry

Larry Freeman said...

Hi Newman!,

My standard version of KNN improved my quiz RMSE by 5.

KNN combined with SVD++, etc. gave me a quiz score of .8848.

by32768 said...

Hi Larry,

There's a typo here: "My standard version of KNN improved my quiz RMSE by 5." ?

I just finish a blog on my optimizations of KNN calculations. I'd appreciate it if you take a quick look and tell me if some part is not clear ?

Larry Freeman said...

Hi Newman!,

Yes, a typo. I meant, I improved by .0005.

I've been thinking about RMSE with a multiple of 1000.

I haven't submitted my implementation of the BellKor knn yet. I am hoping to do that tonight.

I'll be very glad to review your blog entry.

Cheers,

-Larry

GETA said...

Hi Larry,


Thank you for this blog !

T have a question about the pearson correlation formula:

When I compute correlation between 2 movies evaluated by 2 users

Exemple:
user 1 gives following ratings : 1, 1, 1 to movie : M1, M2, M3.
user 2 gives following ratings : 1,1,1 to movies: M1, M2, M3.

the result of the pearson correlation formula is 0.

So How I could interprate this result?

Thank for your helps.

Amel.

Larry Freeman said...

Hi Amel,

Remember a result is between 1 and -1. 0 Means that the two results are not as similar as they could be (1 is closest) and they are not as dissimilar as they could be (-1 is the farthest).

-Larry