Sunday, March 2, 2008

BellKor Algorithm: Overview

In their article "The Bellkor solution to the Netflix Prize," the authors state that their algorithm is the blending of "107 individual results."

In the solution outlined, the article presupposes detailed understanding of each of the approaches as well as the linear regression analysis used to blend the results.

They list out the 107 results and categorize the results into the following categories:

1. Asymmetric factor models (9 results)
2. Regression models (15 results)
3. Restricted Boltzmann Machines with Gaussian visible units (8 results)
4. Restricted Boltzmann Machines (8 results)
5. Matrix factorization (30 results)
6. Neighborhood-based model (k-NN) (18 results)
7. Combinations (9 results)
8. Imputation of Qualifying predictions (7 results)
9. Specials (3 results)

Blending is accomplished by viewing the combination of results "as a linear regression problem".

In retrospect, the authors do not believe that all 107 results are necessary. They write:
For completeness, we listed all 107 results that were blended in our RMSE=0.8712 submission. It is important to note that the major reason for using all these results was convenience, as we anyway accumulated them during the year. This is the nature of an ongoing competition. However, in hindsight, we would probably drop many of these results and recompute some in a different way. We believe that far fewer results are really necessary.

They then list out some of the factors out of the 107 that have the most impact:
  • A blend of 3 results leads to an RMSE of 0.8793
  • A blend of 11 results leads to an 8% improvement over the Cinematch score.
Their recommended approach to the Netflix Challenge is the following:
Predictive accuracy is substantially improved when blending mutliple predictors. Our experience is that most efforts should be concentrated in deriving substantially different approaches, rather than refining a single technique. (italics in original)

In my next blog, I will go into more detail on the Neighborhood-based model (k-NN).


No comments: