The Neighborhood-based algorithm as described by the BellKor team has three major steps:
1) data normalization
In this step, an attempt is made to control for "global effects", that is, effects where certain users tend to rate movies higher or lower or it can also include interaction effects such that the probability that the Fellowship of the Rings, Two Towers, Return of the King will probably have similar ratings. Movies may get higher ratings when they first come out or lower ratings when they've been out a very long time. Some movies are have a large number of ratings. Some movies have a much smaller number.
For the probe data, predicting movie ratings based solely on the mean results in an RMSE of 1.1296 (see my earlier blog for review of RMSE if needed). Controlling for "global effects" gave this same date an RMSE of 0.9657 (according to the cited article).
In the actual kNN algorithm, residual data is processed. Using the algorithm without removing "global effects" resulted in an RMSE of 0.9536 (according to the cited article). Using kNN after removing "global effects" and matrix factorization data resulted in an RMSE of 0.9071 (according to the cited article).
I will not cover the removal of "global effects" or matrix factorization in today's blog. For purposes of focusing solely on the kNN algorithm, I wil assume that all data is raw (that is, it has not been normalized).
In a future blog, I will cover "global effects" and matrix factorization and try to reproduce the BellKor results cited.
2) neighbor selection
This is where a similarity function is selected and where this similiarity function is used to determine the k nearest neighbors. This can be approached in a user-oriented way (a function on similiar users) or in an item-oriented way (a function on simlar items).
I will go into detail about the formulas used in the article for the similarity function. In a future blog, I will get into code and compare the BellKor approach to other kNN approaches.
3) determination of the interpolation weights
Determination of interpolation weights is very important since they are tied very closed to the prediction. Bell and Koren compare results they reached with results from a "correlation-based interpolation."
They found that for raw scores, correlation-based interpolation resulted in an RMSE of 0.9947. In contrast, their approach resulted in an RMSE of 0.9536. After removing "global effects" and matrix factorization results, the correlated-based approach resulted in 0.9142 in comparison to team BellKor's result of 0.9071.
Below I will detail the formulas for BellKor approach of "jointly derived interpolation."
Disclaimer
I would also like to make a clear disclaimer. I am not an expert in learning theory nor the ideas presented in the article. This summary is my best effort to understand the formulas in the article. I would encourage everyone to read the original articles. If you are an expert in learning theory, please review this article and post your comments. I am very glad to be corrected on any point which I do not accurately represent.
Neighborhood-based Approach
The Neighborhood-based approach attempts to predict how a user u will rate an item i by looking at previous data. The user-oriented approach looks for users similar to u that have already rated item i. The item-oriented approach looks for items similar to i that user u has already rated. In both cases, the strategy is to find a set of most similar items and then use this set to make the prediction.
One of the first questions that need to be answered is how many elements need to be in such a set? Because of the importance of this question, the neighborhood-based approach is commonly called the "k Nearest Neighbors" (kNN) approach. In the BellKor article, they considered values of k=20, k=35, and k=50. For the raw data, they found the best result with k=20 (RMSE = 0.9536). For the normalized data, they found the same result for all three (RMSE=0.9071).
Notation
As far as notation, the article refers to users with the variables u,v and refers to items with the variables i,j,k.
The set of k Nearest Neighbors is represented as N(u;i) in the case of similar users (the user-oriented approach) or N(i;u) in the case of similar items (the item-oriented approach).
Rating by user u of an item i i s represented as rui.
Selection of N(u;i) or N(i;u) is based on a similarity function sij when item i is compared to item j and suv if user u is compared to user v.
U(i,j) represents the set of users who have rated both item i and item j and |U(i,j)| represents the number of users rating both i and j.
I(u,v) represents the set of items that are rated by both user u and user v and |I(u,v)| represents the number of items that are rated by both user u and user v.
Similarity Function and Neighbor Selection
To build N(i;u), we need a similarity function to determine the similarity of another item j to the given item i. In other words, we need sij.
The formula for sij is:
where α is a small value that we can play around with.
To build N(u;i), we need a similarity function to determine the similarity of another user v to the given user to u. In other words, we need suv.
The formula for suv is:
where α is a small value that we can play around with.
So, we cnow build the set N(u;i) which consists of the k-most similar users that have rated item i (based on maximizing suv) and we can build the set N(i;u) which consists of the k-most similar items that have been rated by user u (based on maximizing sij).
In their article, I am unclear on how α is determined. They say "for some small α".
Determining Interpolation Weights
Next, we need to determine interpolation weights for the prediction.
If using a user-oriented approach, then each weight is wuv.
If using an item-oriented approach, then each weight is wij.
The goal is ultimately to use one of the following formulas for a prediction:
or
If the data is dense enough, then the interpolation can be determined as a least squares problem:
Aw = b
which gives us:
w = A-1b
where A is an invertible K x K matrix defined as:
and b ∈ RK where:
Since, it is unreasonable to assume that the data is dense enough, one improvement would be to average each sum over its total number of members. In this case:
and:
where U(j,k) is the set of users who rated both item j and item k and U(i,j) is the set of users hwo rated both item i and item j.
But this can be quite inaccurate for very small sets of U(i,j). To remedy this, the article recommends using a baseline value which they avg.
We will create two estimates of the above values:
and
In the above estimates, "the parameter β controls the extent of the shrinkage. " For example, in the article, they talk about their reasons for using a β = 500 when used with normalized data.
The value of avg is computed based on the following matrix:
where if jk above is a diagonal entry, then avg = the average of diagonal entries of A
and if jk above is a nondiagonal entry, then avg = the average of nondiagonal entries of A
This then completes the formulas as presented in their article on the Neighborhood Based Approach. In my next blog, I will look into code for implementing the kNN approach and see if I am able to reproduce the RMSE scores reported by the BellKor team.
References
- R. Bell and Y. Koren, "Improved Neighborhood-bsaed Collaborative Filtering", KDD-Cup and Workshop, ACM Press, 2007
- Robert M. Bell, Yehuda Koren, and Chris Volinsky, "The BellKor solution to the Netflix Prize"