Algorithm Analysis: May 2008

Introduction

This article refers to the BellKor Algorithm which won the $50,000 Progress Prize as part of the Netflix Prize contest. The purpose of this article is to provide details on the winning algorithm.

The BellKor team found that removing Global Effects greatly improved the results of applying the K Nearest Neighbor (knn) Algorithm.

In today's blog, I will use MySQL to remove Global Effects in the effort to reproduce the results of the BellKor paper.

Disclaimer

I am not an expert in learning theory or statistics. The work done here is my best effort with some support from the Netflix Prize forum. While I was able to reproduce each of the effects that the BellKor identified, I did not get an exact match on some of the numbers. This indicates that the details provided here may not be exactly the same as done by BellKor.

I am very glad to be corrected by anyone who is able to get an exact match for the BellKor results or is an expert of the field.

MySQL

I chose MySQL because my goal here is to understand the logic while minimizing the amount of coding involved. While using memory mapped files with C/C++ is probably the fastest way to process data on Windows, I found that MySQL with indexed tables and occasional help from java/jdbc was fine for me.

For those interested in the steps I followed to load up the Netflix Prize Data into MySQL and remove the probe data, you can see the details here. The set up described there took me roughly 3 hours to complete.

If anyone has suggestions for reducing the time it takes or improving the calculations, please feel free to post it here.

After following this process, I have the following two tables:

Rating Table

movieid: int(2)
userid: int(3)
rating: int(1)
date: date
first: date
support: int(3)
avg: double
avg2: double
residual: double

The rating table has 99,072,112 ratings which do not include the probe table data.

Probe Table

movieid: int(2)
userid: int(3)
rating: int(1)
date: date
first: date
support: int(3)
avg: double
avg2: double
prediction: double

The probe table has 1,408,395 ratings.

Global Effects

Some users are easier on movies when they select their rating. If they like or hate a movie, they rate it a bit higher. Some uses are tougher in their ratings and they rate movies just a little bit lower. Both of these tendencies are examples of a "global effect." In this case, what the BellKor team call a "user effect". Some movies get higher ratings when they are first released than after they've been out for a while. This is an example of a "movie x time (movie) effect."

In their paper, the BellKor team present a strategy of removing a global effect "one 'effect' at a time." In this way, it is possible to test for "effects", identify them, and then remove them. After each removal, the RMSE goes down (see here for explanation about RMSE if needed).

The idea of "global effects" is very enticing. You identify an effect, you test for it, and then you remove it for the total. As long as you keep finding effects, in theory, the closer you get to the goal of an RMSE=0.85. In practice, the BellKor team found that global effects alone only got them to an RMSE=0.96. In this paper, I will go through each of the 11 global effects that the BellKor team identified in their paper.

Residuals

Since the goal is to remove each "effect" one at a time, the plan is not to focus on the ratings themselves but rather to focus on the "residuals" which are the ratings minus an effect.

To make this point as clear as possible, I will refer to r_ui as the rating of user u on item i. I will refer to each residual as r_ui(n) where n refers to the number of effects that have been removed.

For example:

r_ui(0) = r_ui.

r_ui(1) = r_ui(0) - effect₁.

r_ui(2) = r_ui(1) - effect₂

and so on until we have:

r_ui(n) = r_ui(n-1) - effect_n.

The model for the effect is:

effect_n = θ*x_ui

where:

x_ui = variable of interest

and

θ = effect_n/x_ui

Variable of Interest

BellKor distinguishes between three types of effects:

(1) Main Effects

In this case, x_ui = 1

(2) User Effects

In this case, x_ui = x_ui - avg(x_ui for a given u)

(3) Item Effects

In this case, x_ui = x_ui - avg(x_ui for a given i)

They succinctly call this out by writing:

For other global effects, we center x_ui for each user by subtracting the mean of x_ui for that user.

Estimating θ

To estimate θ, the BellKor paper starts by looking at the idea of an unbiased estimator.

An estimator is a function on sample variables that "estimates" a population variable. An "unbiased estimator" is one where if the sample size is large enough, the estimator is equal to the population variable estimated.

BellKor proposes the following function for estimating the "effect" using an unbiased estimator:

Note: you can tell we are using an "estimator" because of the "hat" atop θ.

The problem with this approach is that in many cases, the sample size is not very large for a given user and there is always the chance that the sample is not large enough for a given item.

Using a Posterior Mean

To compensate for this, BellKor use the concept of a posterior mean. This can be derived by making assumptions about the normal distribution.

If we assume that for θ, the population mean is μ and the population variance is τ², then θ can be characterized:
θ ∼ N(μ,τ²)

Further, using the sample variance σ², the BellKor paper notes:

Using these two observations, they now restate the estimator for θ as its posterior mean:

Assumptions on μ and σ²

Now, since all parameters are centered (that is, x = x - avg(x)), we can assume that the population mean is 0. That is μ = 0.

BellKor further assume that σ² is proportional ot 1/count(r_ui). If we assume that n = count(r_ui) then there exists a k such that σ² = k/n.

Then, the revised formula is:

BellKor now states a very complicated formula for estimating τ²:

Ouch. Well, let's simplify it using the above assumptions. This gets us to:

Better. But then, BellKor do a "magic" step which is not clear to me at this point. They say that
it is possible to greatly simplify the above analysis into the following formula:

effect =

where α is a value calculated through cross-validation and n is the number of ratings.

Resulting Formula for Effect

So, after all the above analysis, the formula for effect is as follows:

effect =

The List of Global Effects

In their paper, BellKor list the following "effects":

Overall Mean (RMSE = 1.1296)
Movie Effect (RMSE = 1.0527)
User Effect (RMSE = 0.9841)
User x Time (user)^1/2 (RMSE = 0.9809)
User x Time (movie)^1/2 (RMSE = 0.9786)
Movie x Time (movie)^1/2 (RMSE = 0.9767)
Movie x Time (user)^1/2 (RMSE = 0.9759)
User x Movie average (RMSE = 0.9719)
User x Movie support (RMSE = 0.9690)
Movie x User average (RMSE = 0.9670)
Movie x User support (RMSE = 0.9657)

First, I go into more detail about cross validation and the resulting α. Then, I will now proceed to show my effort to reproduce each effect using sql.

Cross-Validation Values

Yehuda Koren of BellKor has provided the alpha values that they used:

Movie: 25, 1.0527
User: 7, 0.9841
User-time(user): 550, 0.9809
User-time(movie): 150, 0.9786
Movie-time(movie): 4000, 0.9767
Movie-time(user): 500, 0.9759
User-mean(movie): 90, 0.9719
User-support(movie): 90, 0.9690
Movie-mean(user): 50, 0.9670
Movie-support(user): 50, 0.9658

In general, cross-validation is done simply by changing values. For example, if you use user_effect and set α = 6 and then α = 8, you will see the following pattern:

User effect: α = 6, RMSE = 0.984506
User effect: α = 7, RMSE =0.984436
User effect: α = 8, RMSE = 0.984434
User effect: α = 9, RMSE = 0.984485

Overall Mean

The formulas for the over all mean are:

Using sql, we have:


select sum(rating)/count(rating) from rating;

After less than a minute, the result is 3.6033.

To test out rmse for the overall mean, I run:


update probe set prediction = 3.6033;

create view rmse_probe as select sqrt(sum((rating - prediction)*(rating - prediction))/count(rating)) "rmse" from probe;

select * from rmse_probe;

From this select, I get RMSE = 1.1296. That's a match to the BellKor results.

This then gives us:

r_ui(1) = r_ui - [overall mean]

To update, the rating table, we do the following:


update rating set residual = rating - 3.6033;

This took almost 9 minutes on my machine.

Movie Effect

The formula here is:

x_ui = 1

movie effect = θ_i*x_ui = θ_i*1 = θ_i

The SQL for this is:


create table movie_effect (movieid int(2), sum_r_ui_x_ui double, sum_x_ui_squared double, n_i int(3), theta_i_hat double, theta_i double);

insert into movie_effect (movieid, sum_r_ui_x_ui, sum_x_ui_squared, n_i) select movieid, sum(residual), count(residual), count(residual) from rating group by movieid;

update movie_effect set theta_i_hat = sum_r_ui_x_ui/sum_x_ui_squared;

update movie_effect set theta_i = (n_i*theta_i_hat)/(n_i + 25);

create index index_movie_effect_movieid on movie_effect(movieid);

On my machine, it completed in roughly 3 minutes.

To test it on probe, you do the following:


alter table movie_effect engine=memory;

update probe p set prediction = prediction + (select theta_i from movie_effect where movieid = p.movieid);

The RMSE that I get here is: 1.0527. That's a match to the BellKor results.

r_ui(2) = r_ui(1) - [movie effect]

To update, the rating table, we do the following:



update rating r set residual = residual - (select theta_i from movie_effect where movieid=r.movieid);

alter table movie_effect engine=myisam;

User Effect

The formula here is:

x_ui = 1

user effect = θ_u*x_ui = θ_u*1 = θ_u

Here's the SQL (note: on my system, the user_effect table is too big for memory, so I created a special table memory_user_effect):


create table user_effect (userid int(3), sum_r_ui_x_ui double, sum_x_ui_squared double, n_u int(3), theta_u_hat double, theta_u double);

insert into user_effect (userid, sum_r_ui_x_ui,sum_x_ui_squared,n_u) select userid, sum(residual), count(residual),count(residual) from rating group by userid;

update user_effect set theta_u_hat = sum_r_ui_x_ui/sum_x_ui_squared;

update user_effect set theta_u = (n_u*theta_u_hat)/(n_u + 7);

create index index_user_effect_userid on user_effect(userid);

set max_heap_table_size=32000000;

alter table user_effect engine=memory;

update probe p set prediction = prediction + (select theta_u from user_effect where userid=p.userid);

update probe set prediction = 5 where prediction > 5;

update probe set prediction = 1 where prediction < 1;

The RMSE I get here is: 0.98404. That's a match to the BellKor results.

r_ui(3) = r_ui(2) - [user effect]

To update, the rating table, we do the following:


update rating r set residual = residual - (select theta_u from user_effect where userid=r.userid);

alter table user_effect engine=myisam;

User x Time (user)^1/2

The formula here is:

x_ui = √days - avg(√days)

user x time (user) effect = θ_u*[√days - avg(√days)]

where √days = sqrt(numOfDaysBetween(date,min(date) for user))

Here's what I did for the sql:

(1) Populate the "first" date field of tables: rating and probe


create table user_first_rating as select userid,min(date) "first" from rating group by userid;

create index index_user_first_rating_userid on user_first_rating(userid);

alter table user_first_rating engine=memory;

update rating r set first = (select first from user_first_rating where userid = r.userid);

update probe p set first = (select first from user_first_rating where userid = p.userid);

update probe p set first = date where datediff(date,first) < 0;

(2) Populate "avg" field of tables: rating and probe


create table user_avg_num_days_for_user (userid int(3), avg double);

insert into user_avg_num_days_for_user (userid,avg) select userid,avg(sqrt(datediff(date,first))) from rating group by userid;

create index index_user_avg_num_days_for_user_userid on user_avg_num_days_for_user(userid);

alter table user_avg_num_days_for_user engine=memory;

update rating r set avg = (select avg from user_avg_num_days_for_user where userid = r.userid);

update probe p set avg = (select avg from user_avg_num_days_for_user where userid = p.userid);

alter table user_avg_num_days_for_user engine=myisam;

(3) Compute user_time_user_effect


create table user_time_user_effect (userid int(3), sum_r_ui_x_ui double, sum_x_ui_squared double, n_u int(3), theta_u_hat double, theta_u double);

insert into user_time_user_effect (userid,sum_r_ui_x_ui,sum_x_ui_squared,n_u) select userid,sum(residual*(sqrt(datediff(date,first)) - avg)),sum((sqrt(datediff(date,first)) - avg)*(sqrt(datediff(date,first))-avg)),count(residual) from rating group by userid;

update user_time_user_effect set theta_u_hat = sum_r_ui_x_ui/sum_x_ui_squared;

update user_time_user_effect set theta_u = (n_u*theta_u_hat)/(n_u + 550);

update user_time_user_effect set theta_u = 0 where theta_u is null;

create index index_user_time_user_effect on user_time_user_effect(userid);

alter table user_time_user_effect engine=memory;

update probe p set prediction = prediction + (select theta_u*(sqrt(datediff(date,first)) - avg) from user_time_user_effect where userid=p.userid);

update probe set prediction = 5 where prediction > 5;

update probe set prediction = 1 where prediction < 1;

The RMSE I get here is: 0.9809. That's a match to the BellKor results.

r_ui(4) = r_ui(3) - [user x time(user) effect]

To update, the rating table, we do the following:



update rating r set residual = residual - (select theta_u*(sqrt(datediff(date,first)) - avg) from user_time_user_effect where userid=r.userid);

alter table user_time_user_effect engine=myisam;

User x Time (movie)^1/2

The formula here is:

x_ui = √days - avg(√days)

user x time (user) effect = θ_u*[√days - avg(√days)]

where √days = sqrt(numOfDaysBetween(date,min(date) for movie))

Here's the sql:

(1) Populate the "first" date field of tables: rating and probe


create table movie_first_rating as select movieid,min(date) "first" from rating group by movieid;

create index index_movie_first_rating_movieid on movie_first_rating(movieid);

alter table movie_first_rating engine=memory;

update rating r set first = (select first from movie_first_rating where movieid = r.movieid);

update probe p set first = (select first from movie_first_rating where movieid = p.movieid);

alter table movie_first_rating engine=myisam;

update probe p set first = date where datediff(date,first) < 0;

(2) Populate the "avg" field of rating and probe


create table user_avg_num_days_for_movie (userid int(3), avg double);

insert into user_avg_num_days_for_movie (userid,avg) select userid,avg(sqrt(datediff(date,first))) from rating group by userid;

create index index_user_avg_num_days_for_movie on user_avg_num_days_for_movie(userid);

alter table user_avg_num_days_for_movie engine=memory;

update rating r set avg = (select avg from user_avg_num_days_for_movie where userid = r.userid);

update probe p set avg = (select avg from user_avg_num_days_for_movie where userid=p.userid);

alter table user_avg_num_days_for_movie engine=myisam;

(3) Compute user_time_movie_effect


create table user_time_movie_effect (userid int(3), sum_r_ui_x_ui double, sum_x_ui_squared double, n_u int(3), theta_u_hat double, theta_u double);

insert into user_time_movie_effect (userid,sum_r_ui_x_ui,sum_x_ui_squared,n_u) select userid,sum(residual*(sqrt(datediff(date,first)) - avg)),sum((sqrt(datediff(date,first)) - avg)*(sqrt(datediff(date,first))-avg)),count(residual) from rating group by userid;

update user_time_movie_effect set theta_u_hat = sum_r_ui_x_ui/sum_x_ui_squared;

update user_time_movie_effect set theta_u = (n_u*theta_u_hat)/(n_u + 150);

update user_time_movie_effect set theta_u = 0 where theta_u is null;

create index index_user_time_movie_effect on user_time_movie_effect(userid);

alter table user_time_movie_effect engine=memory;

update probe p set prediction = prediction + (select theta_u*(sqrt(datediff(date,first)) - avg) from user_time_movie_effect where userid=p.userid);

update probe set prediction = 5 where prediction > 5;

update probe set prediction = 1 where prediction < 1;

The RMSE I get here is: 0.9785. That's a match to the BellKor results.

r_ui(5) = r_ui(4) - [user x time(movie) effect]

To update, the rating table, we do the following:



update rating r set residual = residual - (select theta_u*(sqrt(datediff(date,first)) - avg) from user_time_movie_effect where userid=r.userid);

alter table user_time_movie_effect engine=myisam;

Movie x Time (movie)^1/2

The formula here is:

where √days = sqrt(numOfDaysBetween(date,min(date) for movie))

Here's the sql:

(1) Populate the "first" date field of tables: rating and probe


alter table movie_first_rating engine=memory;

update rating r set first = (select first from movie_first_rating where movieid = r.movieid);

update probe p set first = (select first from movie_first_rating where movieid = p.movieid);

update probe set first = date where datediff(date,first) < 0;

alter table movie_first_rating engine=myisam;

(2) Populate the "avg" field of rating and probe


create table movie_avg_num_days_for_movie (movieid int(2), avg double);

insert into movie_avg_num_days_for_movie (movieid,avg) select movieid,avg(sqrt(datediff(date,first))) from rating group by movieid;


create index index_movie_avg_num_days_for_movie on movie_avg_num_days_for_movie(movieid);


alter table movie_avg_num_days_for_movie engine=memory;

update rating r set avg = (select avg from movie_avg_num_days_for_movie where movieid = r.movieid);

update probe p set avg = (select avg from movie_avg_num_days_for_movie where movieid=p.movieid);

alter table movie_avg_num_days_for_movie engine=myisam;

(3) Compute movie_time_movie_effect


create table movie_time_movie_effect (movieid int(2), sum_r_ui_x_ui double, sum_x_ui_squared double, n_i int(3), theta_i_hat double, theta_i double);

insert into movie_time_movie_effect (movieid,sum_r_ui_x_ui,sum_x_ui_squared,n_i) select movieid,sum(residual*(sqrt(datediff(date,first)) - avg)),sum((sqrt(datediff(date,first)) - avg)*(sqrt(datediff(date,first))-avg)),count(residual) from rating group by movieid;

update movie_time_movie_effect set theta_i_hat = sum_r_ui_x_ui/sum_x_ui_squared;

update movie_time_movie_effect set theta_i = (n_i*theta_i_hat)/(n_i + 4000);

update movie_time_movie_effect set theta_i = 0 where theta_i is null;

create index index_movie_time_movie_effect on movie_time_movie_effect(movieid);

alter table movie_time_movie_effect engine=memory;

update probe p set prediction = prediction + (select theta_i*(sqrt(datediff(date,first)) - avg) from movie_time_movie_effect where movieid=p.movieid);

update probe set prediction = 5 where prediction > 5;

update probe set prediction = 1 where prediction < 1;

The RMSE I get here is: 0.9766. That's a match to the BellKor results.

r_ui(6) = r_ui(5) - [movie x time(movie) effect]

To update, the rating table, we do the following:



update rating r set residual = residual - (select theta_i*(sqrt(datediff(date,first)) - avg) from movie_time_movie_effect where movieid=r.movieid);

alter table movie_time_movie_effect engine=myisam;

Movie x Time (user)^1/2

The formula here is:

where √days = sqrt(numOfDaysBetween(date,min(date) for user))

Here's the sql:

(1) Populate the "first" date field of tables: rating and probe


alter table user_first_rating engine=memory;

update rating r set first = (select first from user_first_rating where userid = r.userid);

update probe p set first = (select first from user_first_rating where userid = p.userid);

update probe set first = date where datediff(date,first) < 0;

(2) Populate "avg" field of tables: rating and probe.


create table movie_avg_num_days_for_user (movieid int(2), avg double);

insert into movie_avg_num_days_for_user (movieid,avg) select movieid,avg(sqrt(datediff(date,first))) from rating group by movieid;

create index index_movie_avg_num_days_for_user_movieid on movie_avg_num_days_for_user(movieid);

alter table movie_avg_num_days_for_user engine=memory;

update rating r set avg = (select avg from movie_avg_num_days_for_user where movieid = r.movieid);

update probe p set avg = (select avg from movie_avg_num_days_for_user where movieid = p.movieid);

alter table movie_avg_num_days_for_user engine=myisam;

(3) Compute movie_time_user_effect


create table movie_time_user_effect (movieid int(2), sum_r_ui_x_ui double, sum_x_ui_squared double, n_i int(3), theta_i_hat double, theta_i double);

insert into movie_time_user_effect (movieid,sum_r_ui_x_ui,sum_x_ui_squared,n_i) select movieid,sum(residual*(sqrt(datediff(date,first)) - avg)),sum((sqrt(datediff(date,first)) - avg)*(sqrt(datediff(date,first))-avg)),count(rating) from rating group by movieid;

update movie_time_user_effect set theta_i_hat = sum_r_ui_x_ui/sum_x_ui_squared;

update movie_time_user_effect set theta_i = (n_i*theta_i_hat)/(n_i + 500);

create index index_movie_time_user_effect_movieid on movie_time_user_effect(movieid);

alter table movie_time_user_effect engine=memory;

update probe p set prediction = prediction + (select theta_i*(sqrt(datediff(date,first)) - avg) from movie_time_user_effect where movieid = p.movieid);

update probe set prediction = 5 where prediction > 5;

update probe set prediction = 1 where prediction < 1;

The RMSE I get here is: 0.9758. That's a match to the BellKor results.

r_ui(7) = r_ui(6) - [movie x time(user) effect]

To update, the rating table, we do the following:



update rating r set residual = residual - (select theta_i*(sqrt(datediff(date,first)) - avg) from movie_time_user_effect where movieid=r.movieid);

alter table movie_time_user_effect engine=myisam;

User x Movie average

The formula here is:

Here's the sql:

(1) Populate the "avg" field of tables: rating and probe


create table movie_average (movieid int(2), avg double);

insert into movie_average (movieid,avg) select movieid,avg(rating) from rating group by movieid;

create index index_movie_average_movieid on movie_average(movieid);

alter table movie_average engine=memory;

update rating r set avg = (select avg from movie_average where movieid = r.movieid);

update probe p set avg = (select avg from movie_average where movieid = p.movieid);

alter table movie_average engine=myisam;

(2) Compute user_movie_average_effect


create table user_movie_average_effect (userid int(3), sum_r_ui_x_ui double, sum_x_ui_squared double, n_u int(3), theta_u_hat double, theta_u double);

insert into user_movie_average_effect (userid,sum_r_ui_x_ui,sum_x_ui_squared,n_u) select userid,sum(residual*(avg - 3.6033)),sum((avg - 3.6033)*(avg - 3.6033)),count(rating) from rating group by userid;

update user_movie_average_effect set theta_u_hat = sum_r_ui_x_ui/sum_x_ui_squared;

update user_movie_average_effect set theta_u = (n_u*theta_u_hat)/(n_u + 550);

update user_movie_average_effect set theta_u = 0 where theta_u is null;

create index user_movie_average_effect_userid on user_movie_average_effect(userid);

alter table user_movie_average_effect engine=memory;

update probe p set prediction = prediction + (select theta_u*(avg - 3.6033) from user_movie_average_effect where userid=p.userid);

update probe set prediction = 5 where prediction > 5;

update probe set prediction = 1 where prediction < 1;

The RMSE I get here is: 0.9720. That's not an exact match but pretty close to the BellKor reported result of RMSE = 0.9719.

r_ui(8) = r_ui(7) - [user x movie average effect]

To update, the rating table, we do the following:



update rating r set residual = residual - (select theta_u*(avg - 3.6033) from user_movie_average_effect where userid=r.userid);

alter table user_movie_average_effect engine=myisam;

User x Movie support

The formula here is:

where support = the number of ratings for a given movie.

Here's the sql:

(1) Populate the "support" field of tables: rating and probe


create table movie_support (movieid int(2), support int(3));

insert into movie_support(movieid,support) select movieid,count(rating) from rating group by movieid;

create index index_movie_support_movieid on movie_support(movieid);

alter table movie_support engine=memory;

update rating r set support = (select support from movie_support where movieid = r.movieid);

update probe p set support = (select support from movie_support where movieid = p.movieid);

alter table movie_support engine=myisam;

(2) Populate avg field of tables: rating and probe



create table average_movie_support_for_user (userid int(3), avg double);

insert into average_movie_support_for_user(userid,avg) select userid,avg(sqrt(support)) from rating group by userid;

create index index_average_movie_support_for_user_userid on average_movie_support_for_user(userid);

alter table average_movie_support_for_user engine=memory;

update rating r set avg = (select avg from average_movie_support_for_user where userid = r.userid);

update probe p set avg = (select avg from average_movie_support_for_user where userid = p.userid);

alter table average_movie_support_for_user engine=myisam;

(3) Compute user_movie_support_effect


create table user_movie_support_effect (userid int(3), sum_r_ui_x_ui double, sum_x_ui_squared double, n_u int(3), theta_u_hat double, theta_u double);

insert into user_movie_support_ effect (userid,sum_r_ui_x_ui,sum_x_ui_squared,n_u) select userid,sum(residual*(sqrt(support) - avg),sum((sqrt(support) - avg)*(sqrt(support) - avg)),count(rating) from rating group by userid;

update user_movie_support_effect set theta_u_hat = sum_r_ui_x_ui/sum_x_ui_squared;

update user_movie_support_effect set theta_u = (n_u*theta_u_hat)/(n_u + 90);

update user_movie_support_effect set theta_u = 0 where theta_u is null;

create index index_user_movie_support_effect_userid on user_movie_support_effect(userid);

alter table user_movie_support_effect engine=memory;

update probe p set prediction = prediction + (select theta_u*(sqrt(support) - avg) from user_movie_support_effect where userid = p.userid);

update probe set prediction = 5 where prediction > 5;

update probe set prediction = 1 where prediction < 1;

The RMSE I get here is: 0.9691. Still, extremely close to the BellKor reported result of RMSE = 0.9690.

r_ui(9) = r_ui(8) - [user x movie support effect]

To update, the rating table, we do the following:



update rating r set residual = residual - (select theta_u*(sqrt(support) - avg) from user_movie_support_effect where userid=r.userid);

alter table user_movie_support_effect engine=myisam;

Movie x User average

The formula here is:

Here's the sql:

(1) Populate the "avg" field of tables: rating and probe


create table user_average (userid int(3), avg double);

insert into user_average (userid,avg) select userid,avg(rating) from rating group by userid;

create index index_user_average_userid on user_average(userid);

alter table user_average engine=memory;

update rating r set avg = (select avg from user_average where userid = r.userid);

update probe p set avg = (select avg from user_average where userid = p.userid);

alter table user_average engine=myisam;

(2) Populate "avg2" field of tables: rating and probe


alter table movie_average engine=memory;

update rating r set avg2 = (select avg from movie_average where movieid = r.movieid);

update probe p set avg2 = (select avg from movie_average where movieid = p.movieid);

alter table movie_average engine=myisam;

(3) Compute movie_user_average_effect



create table movie_user_average_effect (movieid int(2), sum_r_ui_x_ui double, sum_x_ui_squared double, n_i int(3), theta_i_hat double, theta_i double);

insert into movie_user_average_effect (movieid,sum_r_ui_x_ui,sum_x_ui_squared,n_i) select movieid,sum(residual*(avg - avg2),sum((avg - avg2)*(avg - avg2)),count(rating) from rating group by movieid;

update movie_user_average_effect set theta_i_hat = sum_r_ui_x_ui/sum_x_ui_squared;

update movie_user_average_effect set theta_i = (n_i*theta_i_hat)/(n_i + 50);

create index index_movie_user_average_effect_movieid on movie_user_average_effect(movieid);

alter table movie_user_average_effect engine=memory;

update probe p set prediction = prediction + (select theta_i*(avg - avg2) from memory_movie_user_average_effect where movieid = p.movieid);

update probe set prediction = 5 where prediction > 5;

update probe set prediction = 1 where prediction < 1;

The RMSE I get here is: 0.9680. While I caught an effect, my result is different than BellKor's reported result of RMSE = 0.9670.

r_ui(10) = r_ui(9) - [movie x user average effect]

To update, the rating table, we do the following:



update rating r set residual = residual - (select theta_i*(avg - avg2) from movie_user_average_effect where movie=r.movieid);

alter table movie_user_average_effect engine=myisam;

Movie x User support

The formula here is:

where support = the number rating made by a given user.

Here is the sql:

(1) Populate the "support" field of tables: rating and probe


create table user_support (userid int(3), support int(3));

insert into user_support(userid,support) select userid,count(rating) from rating group by userid;

create index index_user_support_userid on user_support(userid);

alter table user_support engine=memory;

update rating r set support = (select support from user_support where userid=r.userid);

update probe p set support = (select support from user_support where userid=p.userid);

alter table user_support engine=myisam;

(2) Populate "avg" field of tables: rating and probe


create table average_user_support_for_movie (movieid int(2), avg double);

insert into average_user_support_for_movie(movieid,avg) select movieid,avg(support) from rating group by movieid;

create index index_average_user_support_for_movie_movieid on average_user_support_for_movie(movieid);

alter table average_user_support_for_movie engine=memory;

update rating r set avg = (select avg from average_user_support_for_movie where movieid = r.movieid);

update probe p set avg = (select avg from average_user_support_for_movie where movieid = p.movieid);

alter table average_user_support_for_movie engine=myisam;

(3) Compute movie_user_support_effect


create table movie_user_support_effect (movieid int(2), sum_r_ui_x_ui double, sum_x_ui_squared double, n_i int(3), theta_i_hat double, theta_i double);

insert into movie_user_support_effect (movieid,sum_r_ui_x_ui,sum_x_ui_squared,n_i) select movieid,sum(residual*(sqrt(support) - sqrt(avg))),sum((sqrt(support) - sqrt(avg))*(sqrt(support) - sqrt(avg))),count(rating) from rating group by movieid;

update movie_user_support_effect set theta_i_hat = sum_r_ui_x_ui/sum_x_ui_squared;

update movie_user_support_effect set theta_i = (n_i*theta_i_hat)/(n_i + 50);

update movie_user_support_effect set theta_i = 0 where theta_i is null;

create index movie_user_support_effect_movieid on movie_user_support_effect(movieid);

alter table movie_user_support_effect engine=memory;

update probe p set prediction = prediction + (select theta_i*(sqrt(support) - sqrt(avg) from movie_user_support_effect where movieid = p.movieid);

update probe set prediction = 5 where prediction > 5;

update probe set prediction = 1 where prediction < 1;

The RMSE I get here is: 0.9658. That got me back on track to the BellKor reported result of RMSE = 0.9657.

r_ui(11) = r_ui(10) - [movie x user support effect]

To update, the rating table, we do the following:



update rating r set residual = residual - (select theta_i*(avg - avg2) from movie_user_average_effect where movie=r.movieid);

alter table movie_user_average_effect engine=myisam;

Now, the trick is to find any other global effects to see how low you can go.

Observations from Netflix Forum

There is one particular thread on the NetflixPrize forum where the experts have spoken about global effects.

In this discussion, Gavin Potter (Just a Guy in a Garage ranked #4 at the time of the article) notes that he is able to get to an RMSE of 0.95 just using global effects. This impressed Yehuda Koren of the BellKor team who writes:

Gavin, I am very impressed by the RMSE=0.9503 achieved by pure global effects.

Yehuda Koren states that he does not believe that Global Effects alone will get someone to the NetFlix Prize. He writes:

Global effects are neat and intuitive, unlike the latent factor models (SVD and the likes) that by definition seek hidden, less explainable factors.
However, I don’t expect that improving global effects will have much impact on the final RMSE. Latent factor models were proven much more effective at recovering the same information and much beyond. Of course, I might be wrong about this...

In my next article, I will focus on the k-nearest neighbor (knn) algorithm and try to reproduce the BellKor results for global effects + knn.

Just to see what would happen, I ran the above "global effects" from the qualifying.txt and submitted to Netflix. The result against the qualifying data was an RMSE = 0.9665 which is very close to the RMSE = 0.9658 that I received against the probe data.

References

"Weights for Global Effects", Netflix Prize Forum
"Loose Ends", Netflix Prize Forum
"Improved Neighborhood-based Collaborative Filtering",Robert M. Bell and Yehuda Koren, AT&T Labs – Research

Algorithm Analysis

Tuesday, May 27, 2008

BellKor Algorithm: Global Effects

Blog Archive

About Me