Thursday, March 6, 2008

BellKor Algorithm: Neighborhood Based Approach

The content that follows is taken from Robert M. Bell and Yehuda Koren's article "Improved Neighborhood-based Collaborative Filtering." This article is cited as the reference for the Neighborhood-based approach used in the BellKor algorithm.

The Neighborhood-based algorithm as described by the BellKor team has three major steps:

1) data normalization

In this step, an attempt is made to control for "global effects", that is, effects where certain users tend to rate movies higher or lower or it can also include interaction effects such that the probability that the Fellowship of the Rings, Two Towers, Return of the King will probably have similar ratings. Movies may get higher ratings when they first come out or lower ratings when they've been out a very long time. Some movies are have a large number of ratings. Some movies have a much smaller number.

For the probe data, predicting movie ratings based solely on the mean results in an RMSE of 1.1296 (see my earlier blog for review of RMSE if needed). Controlling for "global effects" gave this same date an RMSE of 0.9657 (according to the cited article).

In the actual kNN algorithm, residual data is processed. Using the algorithm without removing "global effects" resulted in an RMSE of 0.9536 (according to the cited article). Using kNN after removing "global effects" and matrix factorization data resulted in an RMSE of 0.9071 (according to the cited article).

I will not cover the removal of "global effects" or matrix factorization in today's blog. For purposes of focusing solely on the kNN algorithm, I wil assume that all data is raw (that is, it has not been normalized).

In a future blog, I will cover "global effects" and matrix factorization and try to reproduce the BellKor results cited.

2) neighbor selection

This is where a similarity function is selected and where this similiarity function is used to determine the k nearest neighbors. This can be approached in a user-oriented way (a function on similiar users) or in an item-oriented way (a function on simlar items).

I will go into detail about the formulas used in the article for the similarity function. In a future blog, I will get into code and compare the BellKor approach to other kNN approaches.

3) determination of the interpolation weights

Determination of interpolation weights is very important since they are tied very closed to the prediction. Bell and Koren compare results they reached with results from a "correlation-based interpolation."

They found that for raw scores, correlation-based interpolation resulted in an RMSE of 0.9947. In contrast, their approach resulted in an RMSE of 0.9536. After removing "global effects" and matrix factorization results, the correlated-based approach resulted in 0.9142 in comparison to team BellKor's result of 0.9071.

Below I will detail the formulas for BellKor approach of "jointly derived interpolation."


I would also like to make a clear disclaimer. I am not an expert in learning theory nor the ideas presented in the article. This summary is my best effort to understand the formulas in the article. I would encourage everyone to read the original articles. If you are an expert in learning theory, please review this article and post your comments. I am very glad to be corrected on any point which I do not accurately represent.

Neighborhood-based Approach

The Neighborhood-based approach attempts to predict how a user u will rate an item i by looking at previous data. The user-oriented approach looks for users similar to u that have already rated item i. The item-oriented approach looks for items similar to i that user u has already rated. In both cases, the strategy is to find a set of most similar items and then use this set to make the prediction.

One of the first questions that need to be answered is how many elements need to be in such a set? Because of the importance of this question, the neighborhood-based approach is commonly called the "k Nearest Neighbors" (kNN) approach. In the BellKor article, they considered values of k=20, k=35, and k=50. For the raw data, they found the best result with k=20 (RMSE = 0.9536). For the normalized data, they found the same result for all three (RMSE=0.9071).


As far as notation, the article refers to users with the variables u,v and refers to items with the variables i,j,k.

The set of k Nearest Neighbors is represented as N(u;i) in the case of similar users (the user-oriented approach) or N(i;u) in the case of similar items (the item-oriented approach).

Rating by user u of an item i i s represented as rui.

Selection of N(u;i) or N(i;u) is based on a similarity function sij when item i is compared to item j and suv if user u is compared to user v.

U(i,j) represents the set of users who have rated both item i and item j and |U(i,j)| represents the number of users rating both i and j.

I(u,v) represents the set of items that are rated by both user u and user v and |I(u,v)| represents the number of items that are rated by both user u and user v.

Similarity Function and Neighbor Selection

To build N(i;u), we need a similarity function to determine the similarity of another item j to the given item i. In other words, we need sij.

The formula for sij is:

where α is a small value that we can play around with.

To build N(u;i), we need a similarity function to determine the similarity of another user v to the given user to u. In other words, we need suv.

The formula for suv is:

where α is a small value that we can play around with.

So, we cnow build the set N(u;i) which consists of the k-most similar users that have rated item i (based on maximizing suv) and we can build the set N(i;u) which consists of the k-most similar items that have been rated by user u (based on maximizing sij).

In their article, I am unclear on how α is determined. They say "for some small α".
Determining Interpolation Weights

Next, we need to determine interpolation weights for the prediction.

If using a user-oriented approach, then each weight is wuv.

If using an item-oriented approach, then each weight is wij.

The goal is ultimately to use one of the following formulas for a prediction:


If the data is dense enough, then the interpolation can be determined as a least squares problem:

Aw = b

which gives us:

w = A-1b

where A is an invertible K x K matrix defined as:

and b ∈ RK where:

Since, it is unreasonable to assume that the data is dense enough, one improvement would be to average each sum over its total number of members. In this case:


where U(j,k) is the set of users who rated both item j and item k and U(i,j) is the set of users hwo rated both item i and item j.

But this can be quite inaccurate for very small sets of U(i,j). To remedy this, the article recommends using a baseline value which they avg.

We will create two estimates of the above values:


In the above estimates, "the parameter β controls the extent of the shrinkage. " For example, in the article, they talk about their reasons for using a β = 500 when used with normalized data.

The value of avg is computed based on the following matrix:

where if jk above is a diagonal entry, then avg = the average of diagonal entries of A

and if jk above is a nondiagonal entry, then avg = the average of nondiagonal entries of A

This then completes the formulas as presented in their article on the Neighborhood Based Approach. In my next blog, I will look into code for implementing the kNN approach and see if I am able to reproduce the RMSE scores reported by the BellKor team.


Sunday, March 2, 2008

BellKor Algorithm: Overview

In their article "The Bellkor solution to the Netflix Prize," the authors state that their algorithm is the blending of "107 individual results."

In the solution outlined, the article presupposes detailed understanding of each of the approaches as well as the linear regression analysis used to blend the results.

They list out the 107 results and categorize the results into the following categories:

1. Asymmetric factor models (9 results)
2. Regression models (15 results)
3. Restricted Boltzmann Machines with Gaussian visible units (8 results)
4. Restricted Boltzmann Machines (8 results)
5. Matrix factorization (30 results)
6. Neighborhood-based model (k-NN) (18 results)
7. Combinations (9 results)
8. Imputation of Qualifying predictions (7 results)
9. Specials (3 results)

Blending is accomplished by viewing the combination of results "as a linear regression problem".

In retrospect, the authors do not believe that all 107 results are necessary. They write:
For completeness, we listed all 107 results that were blended in our RMSE=0.8712 submission. It is important to note that the major reason for using all these results was convenience, as we anyway accumulated them during the year. This is the nature of an ongoing competition. However, in hindsight, we would probably drop many of these results and recompute some in a different way. We believe that far fewer results are really necessary.

They then list out some of the factors out of the 107 that have the most impact:
  • A blend of 3 results leads to an RMSE of 0.8793
  • A blend of 11 results leads to an 8% improvement over the Cinematch score.
Their recommended approach to the Netflix Challenge is the following:
Predictive accuracy is substantially improved when blending mutliple predictors. Our experience is that most efforts should be concentrated in deriving substantially different approaches, rather than refining a single technique. (italics in original)

In my next blog, I will go into more detail on the Neighborhood-based model (k-NN).