Tuesday, May 27, 2008

BellKor Algorithm: Global Effects

Introduction

This article refers to the BellKor Algorithm which won the $50,000 Progress Prize as part of the Netflix Prize contest. The purpose of this article is to provide details on the winning algorithm.

The BellKor team found that removing Global Effects greatly improved the results of applying the K Nearest Neighbor (knn) Algorithm.

In today's blog, I will use MySQL to remove Global Effects in the effort to reproduce the results of the BellKor paper.

Disclaimer

I am not an expert in learning theory or statistics. The work done here is my best effort with some support from the Netflix Prize forum. While I was able to reproduce each of the effects that the BellKor identified, I did not get an exact match on some of the numbers. This indicates that the details provided here may not be exactly the same as done by BellKor.

I am very glad to be corrected by anyone who is able to get an exact match for the BellKor results or is an expert of the field.

MySQL

I chose MySQL because my goal here is to understand the logic while minimizing the amount of coding involved. While using memory mapped files with C/C++ is probably the fastest way to process data on Windows, I found that MySQL with indexed tables and occasional help from java/jdbc was fine for me.

For those interested in the steps I followed to load up the Netflix Prize Data into MySQL and remove the probe data, you can see the details here. The set up described there took me roughly 3 hours to complete.

If anyone has suggestions for reducing the time it takes or improving the calculations, please feel free to post it here.

After following this process, I have the following two tables:

Rating Table
  • movieid: int(2)
  • userid: int(3)
  • rating: int(1)
  • date: date
  • first: date
  • support: int(3)
  • avg: double
  • avg2: double
  • residual: double
The rating table has 99,072,112 ratings which do not include the probe table data.

Probe Table
  • movieid: int(2)
  • userid: int(3)
  • rating: int(1)
  • date: date
  • first: date
  • support: int(3)
  • avg: double
  • avg2: double
  • prediction: double
The probe table has 1,408,395 ratings.

Global Effects

Some users are easier on movies when they select their rating. If they like or hate a movie, they rate it a bit higher. Some uses are tougher in their ratings and they rate movies just a little bit lower. Both of these tendencies are examples of a "global effect." In this case, what the BellKor team call a "user effect". Some movies get higher ratings when they are first released than after they've been out for a while. This is an example of a "movie x time (movie) effect."

In their paper, the BellKor team present a strategy of removing a global effect "one 'effect' at a time." In this way, it is possible to test for "effects", identify them, and then remove them. After each removal, the RMSE goes down (see here for explanation about RMSE if needed).

The idea of "global effects" is very enticing. You identify an effect, you test for it, and then you remove it for the total. As long as you keep finding effects, in theory, the closer you get to the goal of an RMSE=0.85. In practice, the BellKor team found that global effects alone only got them to an RMSE=0.96. In this paper, I will go through each of the 11 global effects that the BellKor team identified in their paper.

Residuals

Since the goal is to remove each "effect" one at a time, the plan is not to focus on the ratings themselves but rather to focus on the "residuals" which are the ratings minus an effect.

To make this point as clear as possible, I will refer to rui as the rating of user u on item i. I will refer to each residual as rui(n) where n refers to the number of effects that have been removed.

For example:

rui(0) = rui.

rui(1) = rui(0) - effect1.

rui(2) = rui(1) - effect2

and so on until we have:

rui(n) = rui(n-1) - effectn.

The model for the effect is:

effectn = θ*xui

where:

xui = variable of interest

and

θ = effectn/xui

Variable of Interest

BellKor distinguishes between three types of effects:

(1) Main Effects

In this case, xui = 1

(2) User Effects

In this case, xui = xui - avg(xui for a given u)

(3) Item Effects

In this case, xui = xui - avg(xui for a given i)

They succinctly call this out by writing:
For other global effects, we center xui for each user by subtracting the mean of xui for that user.
Estimating θ

To estimate θ, the BellKor paper starts by looking at the idea of an unbiased estimator.

An estimator is a function on sample variables that "estimates" a population variable. An "unbiased estimator" is one where if the sample size is large enough, the estimator is equal to the population variable estimated.

BellKor proposes the following function for estimating the "effect" using an unbiased estimator:



Note: you can tell we are using an "estimator" because of the "hat" atop θ.

The problem with this approach is that in many cases, the sample size is not very large for a given user and there is always the chance that the sample is not large enough for a given item.

Using a Posterior Mean


To compensate for this, BellKor use the concept of a posterior mean. This can be derived by making assumptions about the normal distribution.

If we assume that for θ, the population mean is μ and the population variance is τ2, then θ can be characterized:
θ ∼ N(μ,τ2)

Further, using the sample variance σ2, the BellKor paper notes:



Using these two observations, they now restate the estimator for θ as its posterior mean:


Assumptions on μ and σ2

Now, since all parameters are centered (that is, x = x - avg(x)), we can assume that the population mean is 0. That is μ = 0.

BellKor further assume that σ2 is proportional ot 1/count(rui). If we assume that n = count(rui) then there exists a k such that σ2 = k/n.

Then, the revised formula is:



BellKor now states a very complicated formula for estimating τ2:



Ouch. Well, let's simplify it using the above assumptions. This gets us to:



Better. But then, BellKor do a "magic" step which is not clear to me at this point. They say that
it is possible to greatly simplify the above analysis into the following formula:

effect =



where α is a value calculated through cross-validation and n is the number of ratings.


Resulting Formula for Effect

So, after all the above analysis, the formula for effect is as follows:

effect =





The List of Global Effects

In their paper, BellKor list the following "effects":
  1. Overall Mean (RMSE = 1.1296)
  2. Movie Effect (RMSE = 1.0527)
  3. User Effect (RMSE = 0.9841)
  4. User x Time (user)1/2 (RMSE = 0.9809)
  5. User x Time (movie)1/2 (RMSE = 0.9786)
  6. Movie x Time (movie)1/2 (RMSE = 0.9767)
  7. Movie x Time (user)1/2 (RMSE = 0.9759)
  8. User x Movie average (RMSE = 0.9719)
  9. User x Movie support (RMSE = 0.9690)
  10. Movie x User average (RMSE = 0.9670)
  11. Movie x User support (RMSE = 0.9657)
First, I go into more detail about cross validation and the resulting α. Then, I will now proceed to show my effort to reproduce each effect using sql.

Cross-Validation Values

Yehuda Koren of BellKor has provided the alpha values that they used:
Movie: 25, 1.0527
User: 7, 0.9841
User-time(user): 550, 0.9809
User-time(movie): 150, 0.9786
Movie-time(movie): 4000, 0.9767
Movie-time(user): 500, 0.9759
User-mean(movie): 90, 0.9719
User-support(movie): 90, 0.9690
Movie-mean(user): 50, 0.9670
Movie-support(user): 50, 0.9658
In general, cross-validation is done simply by changing values. For example, if you use user_effect and set α = 6 and then α = 8, you will see the following pattern:

User effect: α = 6, RMSE = 0.984506
User effect: α = 7, RMSE =0.984436
User effect: α = 8, RMSE = 0.984434
User effect: α = 9, RMSE = 0.984485

Overall Mean


The formulas for the over all mean are:



Using sql, we have:


select sum(rating)/count(rating) from rating;


After less than a minute, the result is 3.6033.

To test out rmse for the overall mean, I run:


update probe set prediction = 3.6033;

create view rmse_probe as select sqrt(sum((rating - prediction)*(rating - prediction))/count(rating)) "rmse" from probe;

select * from rmse_probe;


From this select, I get RMSE = 1.1296. That's a match to the BellKor results.

This then gives us:

rui(1) = rui - [overall mean]

To update, the rating table, we do the following:


update rating set residual = rating - 3.6033;


This took almost 9 minutes on my machine.

Movie Effect

The formula here is:

xui = 1





movie effect = θi*xui = θi*1 = θi

The SQL for this is:


create table movie_effect (movieid int(2), sum_r_ui_x_ui double, sum_x_ui_squared double, n_i int(3), theta_i_hat double, theta_i double);

insert into movie_effect (movieid, sum_r_ui_x_ui, sum_x_ui_squared, n_i) select movieid, sum(residual), count(residual), count(residual) from rating group by movieid;

update movie_effect set theta_i_hat = sum_r_ui_x_ui/sum_x_ui_squared;

update movie_effect set theta_i = (n_i*theta_i_hat)/(n_i + 25);

create index index_movie_effect_movieid on movie_effect(movieid);



On my machine, it completed in roughly 3 minutes.

To test it on probe, you do the following:


alter table movie_effect engine=memory;

update probe p set prediction = prediction + (select theta_i from movie_effect where movieid = p.movieid);



The RMSE that I get here is: 1.0527. That's a match to the BellKor results.

rui(2) = rui(1) - [movie effect]

To update, the rating table, we do the following:



update rating r set residual = residual - (select theta_i from movie_effect where movieid=r.movieid);

alter table movie_effect engine=myisam;


User Effect

The formula here is:

xui = 1





user effect = θu*xui = θu*1 = θu

Here's the SQL (note: on my system, the user_effect table is too big for memory, so I created a special table memory_user_effect):


create table user_effect (userid int(3), sum_r_ui_x_ui double, sum_x_ui_squared double, n_u int(3), theta_u_hat double, theta_u double);

insert into user_effect (userid, sum_r_ui_x_ui,sum_x_ui_squared,n_u) select userid, sum(residual), count(residual),count(residual) from rating group by userid;

update user_effect set theta_u_hat = sum_r_ui_x_ui/sum_x_ui_squared;

update user_effect set theta_u = (n_u*theta_u_hat)/(n_u + 7);

create index index_user_effect_userid on user_effect(userid);

set max_heap_table_size=32000000;

alter table user_effect engine=memory;

update probe p set prediction = prediction + (select theta_u from user_effect where userid=p.userid);

update probe set prediction = 5 where prediction > 5;

update probe set prediction = 1 where prediction < 1;



The RMSE I get here is: 0.98404. That's a match to the BellKor results.

rui(3) = rui(2) - [user effect]

To update, the rating table, we do the following:


update rating r set residual = residual - (select theta_u from user_effect where userid=r.userid);

alter table user_effect engine=myisam;


User x Time (user)1/2

The formula here is:

xui = √days - avg(√days)





user x time (user) effect = θu*[days - avg(√days)]

where days = sqrt(numOfDaysBetween(date,min(date) for user))

Here's what I did for the sql:

(1) Populate the "first" date field of tables: rating and probe


create table user_first_rating as select userid,min(date) "first" from rating group by userid;

create index index_user_first_rating_userid on user_first_rating(userid);

alter table user_first_rating engine=memory;

update rating r set first = (select first from user_first_rating where userid = r.userid);

update probe p set first = (select first from user_first_rating where userid = p.userid);

update probe p set first = date where datediff(date,first) < 0;


(2) Populate "avg" field of tables: rating and probe


create table user_avg_num_days_for_user (userid int(3), avg double);

insert into user_avg_num_days_for_user (userid,avg) select userid,avg(sqrt(datediff(date,first))) from rating group by userid;

create index index_user_avg_num_days_for_user_userid on user_avg_num_days_for_user(userid);

alter table user_avg_num_days_for_user engine=memory;

update rating r set avg = (select avg from user_avg_num_days_for_user where userid = r.userid);

update probe p set avg = (select avg from user_avg_num_days_for_user where userid = p.userid);

alter table user_avg_num_days_for_user engine=myisam;



(3) Compute user_time_user_effect


create table user_time_user_effect (userid int(3), sum_r_ui_x_ui double, sum_x_ui_squared double, n_u int(3), theta_u_hat double, theta_u double);

insert into user_time_user_effect (userid,sum_r_ui_x_ui,sum_x_ui_squared,n_u) select userid,sum(residual*(sqrt(datediff(date,first)) - avg)),sum((sqrt(datediff(date,first)) - avg)*(sqrt(datediff(date,first))-avg)),count(residual) from rating group by userid;

update user_time_user_effect set theta_u_hat = sum_r_ui_x_ui/sum_x_ui_squared;

update user_time_user_effect set theta_u = (n_u*theta_u_hat)/(n_u + 550);

update user_time_user_effect set theta_u = 0 where theta_u is null;

create index index_user_time_user_effect on user_time_user_effect(userid);

alter table user_time_user_effect engine=memory;

update probe p set prediction = prediction + (select theta_u*(sqrt(datediff(date,first)) - avg) from user_time_user_effect where userid=p.userid);

update probe set prediction = 5 where prediction > 5;

update probe set prediction = 1 where prediction < 1;



The RMSE I get here is: 0.9809. That's a match to the BellKor results.

rui(4) = rui(3) - [user x time(user) effect]

To update, the rating table, we do the following:



update rating r set residual = residual - (select theta_u*(sqrt(datediff(date,first)) - avg) from user_time_user_effect where userid=r.userid);

alter table user_time_user_effect engine=myisam;


User x Time (movie)1/2

The formula here is:

xui = √days - avg(√days)





user x time (user) effect = θu*[days - avg(√days)]

where days = sqrt(numOfDaysBetween(date,min(date) for movie))

Here's the sql:

(1) Populate the "first" date field of tables: rating and probe


create table movie_first_rating as select movieid,min(date) "first" from rating group by movieid;

create index index_movie_first_rating_movieid on movie_first_rating(movieid);

alter table movie_first_rating engine=memory;

update rating r set first = (select first from movie_first_rating where movieid = r.movieid);

update probe p set first = (select first from movie_first_rating where movieid = p.movieid);

alter table movie_first_rating engine=myisam;

update probe p set first = date where datediff(date,first) < 0;


(2) Populate the "avg" field of rating and probe


create table user_avg_num_days_for_movie (userid int(3), avg double);

insert into user_avg_num_days_for_movie (userid,avg) select userid,avg(sqrt(datediff(date,first))) from rating group by userid;

create index index_user_avg_num_days_for_movie on user_avg_num_days_for_movie(userid);

alter table user_avg_num_days_for_movie engine=memory;

update rating r set avg = (select avg from user_avg_num_days_for_movie where userid = r.userid);

update probe p set avg = (select avg from user_avg_num_days_for_movie where userid=p.userid);

alter table user_avg_num_days_for_movie engine=myisam;



(3) Compute user_time_movie_effect


create table user_time_movie_effect (userid int(3), sum_r_ui_x_ui double, sum_x_ui_squared double, n_u int(3), theta_u_hat double, theta_u double);

insert into user_time_movie_effect (userid,sum_r_ui_x_ui,sum_x_ui_squared,n_u) select userid,sum(residual*(sqrt(datediff(date,first)) - avg)),sum((sqrt(datediff(date,first)) - avg)*(sqrt(datediff(date,first))-avg)),count(residual) from rating group by userid;

update user_time_movie_effect set theta_u_hat = sum_r_ui_x_ui/sum_x_ui_squared;

update user_time_movie_effect set theta_u = (n_u*theta_u_hat)/(n_u + 150);

update user_time_movie_effect set theta_u = 0 where theta_u is null;

create index index_user_time_movie_effect on user_time_movie_effect(userid);

alter table user_time_movie_effect engine=memory;

update probe p set prediction = prediction + (select theta_u*(sqrt(datediff(date,first)) - avg) from user_time_movie_effect where userid=p.userid);

update probe set prediction = 5 where prediction > 5;

update probe set prediction = 1 where prediction < 1;



The RMSE I get here is: 0.9785. That's a match to the BellKor results.

rui(5) = rui(4) - [user x time(movie) effect]


To update, the rating table, we do the following:



update rating r set residual = residual - (select theta_u*(sqrt(datediff(date,first)) - avg) from user_time_movie_effect where userid=r.userid);

alter table user_time_movie_effect engine=myisam;


Movie x Time (movie)1/2

The formula here is:





where days = sqrt(numOfDaysBetween(date,min(date) for movie))

Here's the sql:

(1) Populate the "first" date field of tables: rating and probe


alter table movie_first_rating engine=memory;

update rating r set first = (select first from movie_first_rating where movieid = r.movieid);

update probe p set first = (select first from movie_first_rating where movieid = p.movieid);

update probe set first = date where datediff(date,first) < 0;

alter table movie_first_rating engine=myisam;


(2) Populate the "avg" field of rating and probe


create table movie_avg_num_days_for_movie (movieid int(2), avg double);

insert into movie_avg_num_days_for_movie (movieid,avg) select movieid,avg(sqrt(datediff(date,first))) from rating group by movieid;


create index index_movie_avg_num_days_for_movie on movie_avg_num_days_for_movie(movieid);


alter table movie_avg_num_days_for_movie engine=memory;

update rating r set avg = (select avg from movie_avg_num_days_for_movie where movieid = r.movieid);

update probe p set avg = (select avg from movie_avg_num_days_for_movie where movieid=p.movieid);

alter table movie_avg_num_days_for_movie engine=myisam;



(3) Compute movie_time_movie_effect


create table movie_time_movie_effect (movieid int(2), sum_r_ui_x_ui double, sum_x_ui_squared double, n_i int(3), theta_i_hat double, theta_i double);

insert into movie_time_movie_effect (movieid,sum_r_ui_x_ui,sum_x_ui_squared,n_i) select movieid,sum(residual*(sqrt(datediff(date,first)) - avg)),sum((sqrt(datediff(date,first)) - avg)*(sqrt(datediff(date,first))-avg)),count(residual) from rating group by movieid;

update movie_time_movie_effect set theta_i_hat = sum_r_ui_x_ui/sum_x_ui_squared;

update movie_time_movie_effect set theta_i = (n_i*theta_i_hat)/(n_i + 4000);

update movie_time_movie_effect set theta_i = 0 where theta_i is null;

create index index_movie_time_movie_effect on movie_time_movie_effect(movieid);

alter table movie_time_movie_effect engine=memory;

update probe p set prediction = prediction + (select theta_i*(sqrt(datediff(date,first)) - avg) from movie_time_movie_effect where movieid=p.movieid);

update probe set prediction = 5 where prediction > 5;

update probe set prediction = 1 where prediction < 1;



The RMSE I get here is: 0.9766. That's a match to the BellKor results.

rui(6) = rui(5) - [movie x time(movie) effect]


To update, the rating table, we do the following:



update rating r set residual = residual - (select theta_i*(sqrt(datediff(date,first)) - avg) from movie_time_movie_effect where movieid=r.movieid);

alter table movie_time_movie_effect engine=myisam;


Movie x Time (user)1/2

The formula here is:





wheredays = sqrt(numOfDaysBetween(date,min(date) for user))

Here's the sql:

(1) Populate the "first" date field of tables: rating and probe


alter table user_first_rating engine=memory;

update rating r set first = (select first from user_first_rating where userid = r.userid);

update probe p set first = (select first from user_first_rating where userid = p.userid);

update probe set first = date where datediff(date,first) < 0;


(2) Populate "avg" field of tables: rating and probe.


create table movie_avg_num_days_for_user (movieid int(2), avg double);

insert into movie_avg_num_days_for_user (movieid,avg) select movieid,avg(sqrt(datediff(date,first))) from rating group by movieid;

create index index_movie_avg_num_days_for_user_movieid on movie_avg_num_days_for_user(movieid);

alter table movie_avg_num_days_for_user engine=memory;

update rating r set avg = (select avg from movie_avg_num_days_for_user where movieid = r.movieid);

update probe p set avg = (select avg from movie_avg_num_days_for_user where movieid = p.movieid);

alter table movie_avg_num_days_for_user engine=myisam;


(3) Compute movie_time_user_effect


create table movie_time_user_effect (movieid int(2), sum_r_ui_x_ui double, sum_x_ui_squared double, n_i int(3), theta_i_hat double, theta_i double);

insert into movie_time_user_effect (movieid,sum_r_ui_x_ui,sum_x_ui_squared,n_i) select movieid,sum(residual*(sqrt(datediff(date,first)) - avg)),sum((sqrt(datediff(date,first)) - avg)*(sqrt(datediff(date,first))-avg)),count(rating) from rating group by movieid;

update movie_time_user_effect set theta_i_hat = sum_r_ui_x_ui/sum_x_ui_squared;

update movie_time_user_effect set theta_i = (n_i*theta_i_hat)/(n_i + 500);

create index index_movie_time_user_effect_movieid on movie_time_user_effect(movieid);

alter table movie_time_user_effect engine=memory;

update probe p set prediction = prediction + (select theta_i*(sqrt(datediff(date,first)) - avg) from movie_time_user_effect where movieid = p.movieid);

update probe set prediction = 5 where prediction > 5;

update probe set prediction = 1 where prediction < 1;



The RMSE I get here is: 0.9758. That's a match to the BellKor results.

rui(7) = rui(6) - [movie x time(user) effect]


To update, the rating table, we do the following:



update rating r set residual = residual - (select theta_i*(sqrt(datediff(date,first)) - avg) from movie_time_user_effect where movieid=r.movieid);

alter table movie_time_user_effect engine=myisam;


User x Movie average

The formula here is:





Here's the sql:

(1) Populate the "avg" field of tables: rating and probe


create table movie_average (movieid int(2), avg double);

insert into movie_average (movieid,avg) select movieid,avg(rating) from rating group by movieid;

create index index_movie_average_movieid on movie_average(movieid);

alter table movie_average engine=memory;

update rating r set avg = (select avg from movie_average where movieid = r.movieid);

update probe p set avg = (select avg from movie_average where movieid = p.movieid);

alter table movie_average engine=myisam;


(2) Compute user_movie_average_effect


create table user_movie_average_effect (userid int(3), sum_r_ui_x_ui double, sum_x_ui_squared double, n_u int(3), theta_u_hat double, theta_u double);

insert into user_movie_average_effect (userid,sum_r_ui_x_ui,sum_x_ui_squared,n_u) select userid,sum(residual*(avg - 3.6033)),sum((avg - 3.6033)*(avg - 3.6033)),count(rating) from rating group by userid;

update user_movie_average_effect set theta_u_hat = sum_r_ui_x_ui/sum_x_ui_squared;

update user_movie_average_effect set theta_u = (n_u*theta_u_hat)/(n_u + 550);

update user_movie_average_effect set theta_u = 0 where theta_u is null;

create index user_movie_average_effect_userid on user_movie_average_effect(userid);

alter table user_movie_average_effect engine=memory;

update probe p set prediction = prediction + (select theta_u*(avg - 3.6033) from user_movie_average_effect where userid=p.userid);

update probe set prediction = 5 where prediction > 5;

update probe set prediction = 1 where prediction < 1;



The RMSE I get here is: 0.9720. That's not an exact match but pretty close to the BellKor reported result of RMSE = 0.9719.

rui(8) = rui(7) - [user x movie average effect]

To update, the rating table, we do the following:



update rating r set residual = residual - (select theta_u*(avg - 3.6033) from user_movie_average_effect where userid=r.userid);

alter table user_movie_average_effect engine=myisam;


User x Movie support

The formula here is:





where support = the number of ratings for a given movie.

Here's the sql:

(1) Populate the "support" field of tables: rating and probe


create table movie_support (movieid int(2), support int(3));

insert into movie_support(movieid,support) select movieid,count(rating) from rating group by movieid;

create index index_movie_support_movieid on movie_support(movieid);

alter table movie_support engine=memory;

update rating r set support = (select support from movie_support where movieid = r.movieid);

update probe p set support = (select support from movie_support where movieid = p.movieid);

alter table movie_support engine=myisam;


(2) Populate avg field of tables: rating and probe



create table average_movie_support_for_user (userid int(3), avg double);

insert into average_movie_support_for_user(userid,avg) select userid,avg(sqrt(support)) from rating group by userid;

create index index_average_movie_support_for_user_userid on average_movie_support_for_user(userid);

alter table average_movie_support_for_user engine=memory;

update rating r set avg = (select avg from average_movie_support_for_user where userid = r.userid);

update probe p set avg = (select avg from average_movie_support_for_user where userid = p.userid);

alter table average_movie_support_for_user engine=myisam;


(3) Compute user_movie_support_effect


create table user_movie_support_effect (userid int(3), sum_r_ui_x_ui double, sum_x_ui_squared double, n_u int(3), theta_u_hat double, theta_u double);

insert into user_movie_support_ effect (userid,sum_r_ui_x_ui,sum_x_ui_squared,n_u) select userid,sum(residual*(sqrt(support) - avg),sum((sqrt(support) - avg)*(sqrt(support) - avg)),count(rating) from rating group by userid;

update user_movie_support_effect set theta_u_hat = sum_r_ui_x_ui/sum_x_ui_squared;

update user_movie_support_effect set theta_u = (n_u*theta_u_hat)/(n_u + 90);

update user_movie_support_effect set theta_u = 0 where theta_u is null;

create index index_user_movie_support_effect_userid on user_movie_support_effect(userid);

alter table user_movie_support_effect engine=memory;

update probe p set prediction = prediction + (select theta_u*(sqrt(support) - avg) from user_movie_support_effect where userid = p.userid);

update probe set prediction = 5 where prediction > 5;

update probe set prediction = 1 where prediction < 1;



The RMSE I get here is: 0.9691. Still, extremely close to the BellKor reported result of RMSE = 0.9690.

rui(9) = rui(8) - [user x movie support effect]


To update, the rating table, we do the following:



update rating r set residual = residual - (select theta_u*(sqrt(support) - avg) from user_movie_support_effect where userid=r.userid);

alter table user_movie_support_effect engine=myisam;


Movie x User average

The formula here is:





Here's the sql:

(1) Populate the "avg" field of tables: rating and probe


create table user_average (userid int(3), avg double);

insert into user_average (userid,avg) select userid,avg(rating) from rating group by userid;

create index index_user_average_userid on user_average(userid);

alter table user_average engine=memory;

update rating r set avg = (select avg from user_average where userid = r.userid);

update probe p set avg = (select avg from user_average where userid = p.userid);

alter table user_average engine=myisam;



(2) Populate "avg2" field of tables: rating and probe


alter table movie_average engine=memory;

update rating r set avg2 = (select avg from movie_average where movieid = r.movieid);

update probe p set avg2 = (select avg from movie_average where movieid = p.movieid);

alter table movie_average engine=myisam;


(3) Compute movie_user_average_effect



create table movie_user_average_effect (movieid int(2), sum_r_ui_x_ui double, sum_x_ui_squared double, n_i int(3), theta_i_hat double, theta_i double);

insert into movie_user_average_effect (movieid,sum_r_ui_x_ui,sum_x_ui_squared,n_i) select movieid,sum(residual*(avg - avg2),sum((avg - avg2)*(avg - avg2)),count(rating) from rating group by movieid;

update movie_user_average_effect set theta_i_hat = sum_r_ui_x_ui/sum_x_ui_squared;

update movie_user_average_effect set theta_i = (n_i*theta_i_hat)/(n_i + 50);

create index index_movie_user_average_effect_movieid on movie_user_average_effect(movieid);

alter table movie_user_average_effect engine=memory;

update probe p set prediction = prediction + (select theta_i*(avg - avg2) from memory_movie_user_average_effect where movieid = p.movieid);

update probe set prediction = 5 where prediction > 5;

update probe set prediction = 1 where prediction < 1;



The RMSE I get here is: 0.9680. While I caught an effect, my result is different than BellKor's reported result of RMSE = 0.9670.

rui(10) = rui(9) - [movie x user average effect]



To update, the rating table, we do the following:



update rating r set residual = residual - (select theta_i*(avg - avg2) from movie_user_average_effect where movie=r.movieid);

alter table movie_user_average_effect engine=myisam;


Movie x User support

The formula here is:





where support = the number rating made by a given user.

Here is the sql:

(1) Populate the "support" field of tables: rating and probe


create table user_support (userid int(3), support int(3));

insert into user_support(userid,support) select userid,count(rating) from rating group by userid;

create index index_user_support_userid on user_support(userid);

alter table user_support engine=memory;

update rating r set support = (select support from user_support where userid=r.userid);

update probe p set support = (select support from user_support where userid=p.userid);

alter table user_support engine=myisam;


(2) Populate "avg" field of tables: rating and probe


create table average_user_support_for_movie (movieid int(2), avg double);

insert into average_user_support_for_movie(movieid,avg) select movieid,avg(support) from rating group by movieid;

create index index_average_user_support_for_movie_movieid on average_user_support_for_movie(movieid);

alter table average_user_support_for_movie engine=memory;

update rating r set avg = (select avg from average_user_support_for_movie where movieid = r.movieid);

update probe p set avg = (select avg from average_user_support_for_movie where movieid = p.movieid);

alter table average_user_support_for_movie engine=myisam;


(3) Compute movie_user_support_effect


create table movie_user_support_effect (movieid int(2), sum_r_ui_x_ui double, sum_x_ui_squared double, n_i int(3), theta_i_hat double, theta_i double);

insert into movie_user_support_effect (movieid,sum_r_ui_x_ui,sum_x_ui_squared,n_i) select movieid,sum(residual*(sqrt(support) - sqrt(avg))),sum((sqrt(support) - sqrt(avg))*(sqrt(support) - sqrt(avg))),count(rating) from rating group by movieid;

update movie_user_support_effect set theta_i_hat = sum_r_ui_x_ui/sum_x_ui_squared;

update movie_user_support_effect set theta_i = (n_i*theta_i_hat)/(n_i + 50);

update movie_user_support_effect set theta_i = 0 where theta_i is null;

create index movie_user_support_effect_movieid on movie_user_support_effect(movieid);

alter table movie_user_support_effect engine=memory;

update probe p set prediction = prediction + (select theta_i*(sqrt(support) - sqrt(avg) from movie_user_support_effect where movieid = p.movieid);

update probe set prediction = 5 where prediction > 5;

update probe set prediction = 1 where prediction < 1;



The RMSE I get here is: 0.9658. That got me back on track to the BellKor reported result of RMSE = 0.9657.

rui(11) = rui(10) - [movie x user support effect]


To update, the rating table, we do the following:



update rating r set residual = residual - (select theta_i*(avg - avg2) from movie_user_average_effect where movie=r.movieid);

alter table movie_user_average_effect engine=myisam;


Now, the trick is to find any other global effects to see how low you can go.

Observations from Netflix Forum

There is one particular thread on the NetflixPrize forum where the experts have spoken about global effects.

In this discussion, Gavin Potter (Just a Guy in a Garage ranked #4 at the time of the article) notes that he is able to get to an RMSE of 0.95 just using global effects. This impressed Yehuda Koren of the BellKor team who writes:

Gavin, I am very impressed by the RMSE=0.9503 achieved by pure global effects.

Yehuda Koren states that he does not believe that Global Effects alone will get someone to the NetFlix Prize. He writes:

Global effects are neat and intuitive, unlike the latent factor models (SVD and the likes) that by definition seek hidden, less explainable factors.
However, I don’t expect that improving global effects will have much impact on the final RMSE. Latent factor models were proven much more effective at recovering the same information and much beyond. Of course, I might be wrong about this...
In my next article, I will focus on the k-nearest neighbor (knn) algorithm and try to reproduce the BellKor results for global effects + knn.

Just to see what would happen, I ran the above "global effects" from the qualifying.txt and submitted to Netflix. The result against the qualifying data was an RMSE = 0.9665 which is very close to the RMSE = 0.9658 that I received against the probe data.

References

31 comments:

Dan said...

Cool. Thank you. I had seen "global effects" referenced in a few places but didn't find a lot of elaboration. Your post was extremely helpful - especially your reproduction of their results.

Unknown said...
This comment has been removed by the author.
Unknown said...

I am confused about one thing: how do we utilize the residuals along with the original ratings in kNN or a similar algorithm? One way would be to use the residuals instead of the original ratings to obtain k nearest neighbors, but at some point, we would need to determine the numerical rating to submit back to netflix. If we use the residuals of the neighbors to come up with the rating, we will somehow need to adjust it back. An alternate would be to use the residuals to figure out neighbors, but then use their original ratings of the target movie to obtain rating for the target user.

Your help is appreciated!

Larry Freeman said...

In the BellKor paper, they work with residuals. This works best since after you remove global effects, the residuals are more normalized, that is, they are more similar to each other.

From this viewpoint, knn works better with residuals. So,you use knn to compute your predicted residual.

Then, to predict the rating, you just add back the global effects.

To make it as clear as possible:

residual used to determine neighbors = rating - movie_effect - user_effect - user_time_user_effect - user_time_movie_effect - ... etc

predicted residual comes from knn based upon the neighbor's residuals.

predicted rating = predicted residual + movie_effect + user_effect + user_time_user_effect + ... etc.

I hope that answers your question.

-Larry

Unknown said...

Larry,

Thanks for a quick response. It makes sense.

A small bug report regarding your other (i.e., setup and configuration) blog: the ratings_temp table is never created, but used. Also "status" field is used, but never created in the rating table. I am assuming that rating_temp is nothing but the rating table

Larry Freeman said...

Hi Hassan,

Thanks for noticing those typos. I fixed both.

Cheers,

-Larry

Unknown said...

I can think of one reason why your results are not exactly the same as BellKore's.

You are adjusting the probe predictions to stay betwen 1 and 5 after applying every effect. You may want to do it only once instead (i.e., after applying all effects). It is certainly possibly that one effect can take a prediction below one and the next one can adjust it back. However, if you have changed it in between to 1, the adjusted prediction will not be the same.

Unknown said...

Hi Larry,

I tried the code on this page. There are way too many typos in it for it to be useful. I fixed many of them but at the end, I managed to delete all probe predictions and almost gave up. It will be great if you could post a SQL file containing the actual statements executed. Appreciate your help!

Larry Freeman said...

Hi Hassan,

Very sorry to hear about your experience with the code.

Thanks for letting me know.

When I have time, I'll release my tested code.

I will post a comment when the tested code is ready.

-Larry

Unknown said...

Thanks a lot!

Gustavo Faigenbaum said...

Dear Larry,

Thanks so much for sharing this effort. I have been trying to reproduce your code in MS SQL and it goes fine until Movie x Time (movie)1/2. After that my RMSE goes down and I cannot reproduce the Bellkor results. I would really like to understand what's wrong with my code; it looks like I'm doing exactly the same as you yet I'm not getting the right results. I was wondering if you would be interested providing some (paid) consulting services and help me review my code and find the bugs. I don't have much money but I hope we can strike a deal that helps me correct my code and learn something new, and makes you earn a few bucks. Please let me know if you would be interested. Thanks,
GF

Larry Freeman said...

Hi Gustavo,

If you would like to post your SQL as a comment, I will take a look.

No worries about any consulting fees. This is a free blog. :-) Perhaps, others are having the same problems as you.

Cheers,

-Larry

Unknown said...

He is right. I experienced the same with your code.

Larry Freeman said...

Hi Hassan,

Thanks very much for confirming the issue. I really appreciate that you both have pointed this out! :-)

Today, I started going through the sQL code: statement by statement.

I've cleaned up a bunch of code already. Still, it will take a 2-3 more days for me to go through it completely.

I will post here when I have verified all the SQL and confirmed the RMSE numbers.

-Larry

Gustavo Faigenbaum said...

Thanks Larry. The problem I had faced on Movie x Time (movie)1/2 was my mistake. There was an error in my translation from your mysql to my ms sql. I have now corrected the error and managed to reproduce up to Global Effect 9. I am getting exactly the same RMSEs as you (rather than bellkor's). I have two more effects to go and I'll be done. Thanks for sharing... Gustavo.

Larry Freeman said...

Hi Gustavo,

I am glad to hear that you are having more success in reproducing. :-)

Please let me know if there are any points that I can better clarify.

Cheers,

-Larry

Gustavo Faigenbaum said...

Hello Larry,

I just finished reproducing all effects, up to effect 11. I got exactly the same results as you. I found only trivial typos in your code (such as missing parentheses or errors in table and field names as hassan mentioned) but no fundamental errors in the logic. Thanks again,

Gustavo

Larry Freeman said...

Gustavo,

Thanks very much for the feedback! :-)

I am reviewing the sql code now so hopefully all the typos will soon be fixed.

-Larry

Anonymous said...

http://www.netflixprize.com/community/viewtopic.php?pid=7726#p7726

Gerard Toonstra said...

I found one bug that may be the reason why the results are slightly off inbetween. You're using 550 in the following line:

update user_movie_average_effect set theta_u = (n_u*theta_u_hat)/(n_u + 550);

While in the cross validation values, 90 was determined. Although this leads to only 0.001 difference in that particular step, maybe the errors this generates have adverse effects on the following steps.

MikeM said...

How do people avoid division by zero in all those effects that calcuate datediff(date,first). Almost 10% of users did all their ratings on the same day. E.g. user 273808 rated 161 moves all on 15-Aug-2005.

Unknown said...

A remark: you said "In general, cross-validation is done simply by changing values". What people might understand is that one simply tries different values for \alpha and checks the RMSE on the probe, until the best value is found. This is incorrect! It is very important not to tweak our learning algorithm on the data we use to test it -- otherwise we unintentionally "mix" our test data into the training data!
Determining parameters by cross validation is done by separating the ratings (without the probe) into, say, 10 equal pieces, and then use 9 pieces for training and 1 for testing -- and average the RMSE on all 10 folds.

Thank you very much for what you do, it's extremely helpful!

ThinkerFeeler said...

You write,

xui = xui - avg(xui for a given u.

Is that a typo? Is it an assignment statement? It's certainly not math.
Perhaps you mean

xui = rui - avg(rui for a given u).

Thanks

Larry Freeman said...

Hi ThinkerFeeler,

I wrote that equation as part of an explanation of how errors are removed from estimates.

I perhaps show my background by using syntax in that way. :-)

While it may not be standard mathematics, it is standard in software.

In Java, for example, it is written as x = x - avg or even more commonly x -= avg.

Cheers,

-Larry

ThinkerFeeler said...

Thanks for the explanation.

It perplexes me that removing global effects helps. Here's why. Suppose neighboring users really like an item. Well, then doesn't that make it more likely that our user u will like the item too? Isn't it the entire point of KNN to make predictions based on neighbors, under the assumption that similar users will have similar tastes?

Similarly, if similar items are popular, then it's likely that our item i will be popular too. Removing the global effect would seem to defeat the entire purpose of nearest neighbors search.

Larry Freeman said...

Hi ThinkerFeeler,

You are right that it doesn't improve anything after you have picked your neighbors.

But the if you remove the global effects before you choose the neighbors, it makes a big difference because if you remove the global effects before picking neighbors.

Interestingly, it has less of an effect if you use it before SVD. In my own use, I got an improved result if I removed global effects after applying SVD.

I encourage you to read the BellKor paper which is linked on the bottom of the blog entry [I updated the link that was previously broken].

-Larry

Thanh said...

Hi,
From your analysis, effect=theta(u) x x(ui)
However, I saw in Koren's paper which said that the model is r(ui) = theta(u) x x(ui) + error.
So in my understand, residual would be "error"???
There is something that I misunderstand. Please explain it.
Thank you so much!

Larry Freeman said...

Hi Thanh,

That's my understanding. As I understand it, the residual represents the "error" from the average.

If there was no "error", then there would be no effect that needed to be identified.

-Larry

Thanh said...

Hi,
I have another question in the Movie x UserAverage effect. In may understand, the formula must be xui = avg(user) - 3.6033, but your formula is xui = avg(user) - avg(movie)???. Could U please explain it?

Thank you so much!
ThanhNH.

Larry Freeman said...

Hi Thanh,

This blog was a documentation of my effort to understand the paper by BellKor.

The formula I describe is the formula that worked for me. It is most likely not the optimal way to isolate an effect.

If your formula works, I would suggest testing to see which formula results in the better RMSE.

It's been a while since I worked on this blog entry so I can't really justify the formula that I posted versus the alternatives.

If you would like to see how I came up with these formulas, I suggest that you go to the Netflix Prize Forums which has much of the discussions archived.

Do a search for "Global Effects" to find the discussions.

Trudy said...

Thank you very much for your post. It is really helpful. I have been working on this kNN algo for some days. This post helps me verify my understanding on the normalization.