<?xml version='1.0' encoding='UTF-8'?><?xml-stylesheet href="http://www.blogger.com/styles/atom.css" type="text/css"?><feed xmlns='http://www.w3.org/2005/Atom' xmlns:openSearch='http://a9.com/-/spec/opensearchrss/1.0/' xmlns:georss='http://www.georss.org/georss' xmlns:gd='http://schemas.google.com/g/2005' xmlns:thr='http://purl.org/syndication/thread/1.0'><id>tag:blogger.com,1999:blog-8592569895405724581</id><updated>2011-11-27T16:16:53.296-08:00</updated><title type='text'>Algorithm Analysis</title><subtitle type='html'></subtitle><link rel='http://schemas.google.com/g/2005#feed' type='application/atom+xml' href='http://algorithmsanalyzed.blogspot.com/feeds/posts/default'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/8592569895405724581/posts/default?max-results=100'/><link rel='alternate' type='text/html' href='http://algorithmsanalyzed.blogspot.com/'/><link rel='hub' href='http://pubsubhubbub.appspot.com/'/><author><name>Larry Freeman</name><uri>http://www.blogger.com/profile/06906614246430481533</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://4.bp.blogspot.com/_wqJeUjTB5sE/SKZUgwi8H5I/AAAAAAAABJ0/miQr_4wF1eg/S220/picture.jpg'/></author><generator version='7.00' uri='http://www.blogger.com'>Blogger</generator><openSearch:totalResults>8</openSearch:totalResults><openSearch:startIndex>1</openSearch:startIndex><openSearch:itemsPerPage>100</openSearch:itemsPerPage><entry><id>tag:blogger.com,1999:blog-8592569895405724581.post-8249983148533477773</id><published>2008-07-22T17:05:00.000-07:00</published><updated>2008-12-10T01:47:00.955-08:00</updated><title type='text'>BellKor Algorithm: Pearson Correlation Coefficient</title><content type='html'>&lt;span style="font-weight: bold;font-size:130%;" &gt;Introduction&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;In their &lt;a href="http://public.research.att.com/%7Eyehuda/pubs/cf-workshop.pdf"&gt;paper&lt;/a&gt; on the k-nearest neighbor (knn) algorithm, the BellKor team mention three choices that they considered for a similarity function: Pearson's Correlation Coefficient, the Cosine Coefficient, and the Root Mean Square Difference.&lt;br /&gt;&lt;br /&gt;In today's blog, I will focus solely on the Pearson Correlation Coefficient.  It is clear from the paper that they did a lot of work with Pearson.  There is also a good explanation about it in the book &lt;a href="http://www.amazon.com/gp/redirect.html?ie=UTF8&amp;amp;location=http%3A%2F%2Fwww.amazon.com%2FProgramming-Collective-Intelligence-Building-Applications%2Fdp%2F0596529325%3Fie%3DUTF8%26s%3Dbooks%26qid%3D1217125608%26sr%3D8-1&amp;amp;tag=themovieadvis-20&amp;amp;linkCode=ur2&amp;amp;camp=1789&amp;amp;creative=9325"&gt;&lt;span style="font-style: italic;"&gt;Collective Intelligence&lt;/span&gt;&lt;/a&gt;.&lt;br /&gt;&lt;br /&gt;In line with the BellKor article, I will focus on the item-item correlation.  Newman! and Tillberg have reported that they can generate this statistic in 20 minutes on a single processor and others have reported generating it in 5 minutes on a quad-core processor.  The algorithm that I describe in today's blog took 2 hours to run and required 1GB RAM on my machine.  It look 45 minutes to load up into mysql.  I wrote the algorithm in java.  I suspect that a C/C++ algorithm can go faster and the code is not well optimized.  My goal here is to present the algorithm in the clearest possible way.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;font-size:130%;" &gt;MySQL Tables&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;It is assumed that a Rating table already exists in mysql that includes all the Netflix training data.  In a &lt;a href="http://setupandconfig.blogspot.com/2008/04/loading-up-netflix-prize-data-into.html"&gt;previous blog&lt;/a&gt;, I went into detail on how I loaded this table from the Netflix data and removed the probe data.  Here is the table:&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;font-size:130%;" &gt;Rating Table&lt;/span&gt;&lt;br /&gt;&lt;ul&gt;&lt;li&gt;&lt;span style="font-weight: bold;"&gt;movieid&lt;/span&gt;:  int(2)&lt;/li&gt;&lt;li&gt;&lt;span style="font-weight: bold;"&gt;userid&lt;/span&gt;:  int(3)&lt;/li&gt;&lt;li&gt;&lt;span style="font-weight: bold;"&gt;rating&lt;/span&gt;:  int(1)&lt;/li&gt;&lt;li&gt;&lt;span style="font-weight: bold;"&gt;date&lt;/span&gt;: date&lt;/li&gt;&lt;li&gt;&lt;span style="font-weight: bold;"&gt;first&lt;/span&gt;: date&lt;/li&gt;&lt;li&gt;&lt;span style="font-weight: bold;"&gt;support&lt;/span&gt;: int(3)&lt;/li&gt;&lt;li&gt;&lt;span style="font-weight: bold;"&gt;avg&lt;/span&gt;: double&lt;/li&gt;&lt;li&gt;&lt;span style="font-weight: bold;"&gt;avg2&lt;/span&gt;: double&lt;/li&gt;&lt;li&gt;&lt;span style="font-weight: bold;"&gt;residual&lt;/span&gt;: double&lt;/li&gt;&lt;/ul&gt;In addition to this table, I created a table which I called Pearson.   Here is the sql for creating this table:&lt;br /&gt;&lt;pre name="code" class="sql"&gt;&lt;br /&gt;CREATE TABLE pearson (movie1 int(2), movie2 int(2), num int(3), pearson float,&lt;br /&gt;index index_pearson_movie1(movie1),&lt;br /&gt;index index_pearson_pearson(pearson));&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;So, we end up with the following table:&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;font-size:130%;" &gt;Pearson Table&lt;/span&gt;&lt;br /&gt;&lt;ul&gt;&lt;li&gt;&lt;span style="font-weight: bold;"&gt;movie1&lt;/span&gt;:  int(2)&lt;/li&gt;&lt;li&gt;&lt;span style="font-weight: bold;"&gt;movie2&lt;/span&gt;:  int(2)&lt;/li&gt;&lt;li&gt;&lt;span style="font-weight: bold;"&gt;num&lt;/span&gt;: int(3)&lt;/li&gt;&lt;li&gt;&lt;span style="font-weight: bold;"&gt;pearson&lt;/span&gt;: float&lt;/li&gt;&lt;/ul&gt;Now, the tricky part is generating the pearson correlation coefficient for all 17,770 tables.  That's a set of 17,770 x 17,770 = over 315 million records.  On a very powerful machine, that may not be a problem but one a 1GB machine, if you do not code the algorithm correctly, the processing can easily take more than 2 hours.&lt;br /&gt;&lt;br /&gt;But before jumping into the algorithm, let's take a look at the Pearson formula.  One of the important optimizations comes down to simplifying the standard formula.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;font-size:130%;" &gt;Pearson Formula&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;Here's the formula for computing the coefficient for two items x and y is the following:&lt;br /&gt;&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://1.bp.blogspot.com/_wqJeUjTB5sE/SIu6qYUeuBI/AAAAAAAABGU/N5aZr92NArE/s1600-h/pearson.png"&gt;&lt;img style="cursor: pointer;" src="http://1.bp.blogspot.com/_wqJeUjTB5sE/SIu6qYUeuBI/AAAAAAAABGU/N5aZr92NArE/s400/pearson.png" alt="" id="BLOGGER_PHOTO_ID_5227477029739214866" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;Following the cue from the book Collective Intelligence by O'Reilly, it gets much simpler, if we decompose it into the following 5 variables:&lt;br /&gt;&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://2.bp.blogspot.com/_wqJeUjTB5sE/SIu7v3z1IcI/AAAAAAAABGc/n3tVCN4wF5s/s1600-h/pearson2.png"&gt;&lt;img style="cursor: pointer;" src="http://2.bp.blogspot.com/_wqJeUjTB5sE/SIu7v3z1IcI/AAAAAAAABGc/n3tVCN4wF5s/s400/pearson2.png" alt="" id="BLOGGER_PHOTO_ID_5227478223603179970" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://1.bp.blogspot.com/_wqJeUjTB5sE/SIu724F0aSI/AAAAAAAABGk/2m5v-D8U08I/s1600-h/pearson3.png"&gt;&lt;img style="cursor: pointer;" src="http://1.bp.blogspot.com/_wqJeUjTB5sE/SIu724F0aSI/AAAAAAAABGk/2m5v-D8U08I/s400/pearson3.png" alt="" id="BLOGGER_PHOTO_ID_5227478343937714466" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://4.bp.blogspot.com/_wqJeUjTB5sE/SIu7-uR8SxI/AAAAAAAABGs/t4jeUPYLoBM/s1600-h/pearson4.png"&gt;&lt;img style="cursor: pointer;" src="http://4.bp.blogspot.com/_wqJeUjTB5sE/SIu7-uR8SxI/AAAAAAAABGs/t4jeUPYLoBM/s400/pearson4.png" alt="" id="BLOGGER_PHOTO_ID_5227478478743161618" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://2.bp.blogspot.com/_wqJeUjTB5sE/SIu8GF2893I/AAAAAAAABG0/eSuwMwO2Baw/s1600-h/pearson5.png"&gt;&lt;img style="cursor: pointer;" src="http://2.bp.blogspot.com/_wqJeUjTB5sE/SIu8GF2893I/AAAAAAAABG0/eSuwMwO2Baw/s400/pearson5.png" alt="" id="BLOGGER_PHOTO_ID_5227478605331494770" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://3.bp.blogspot.com/_wqJeUjTB5sE/SIu8QKudzCI/AAAAAAAABG8/iI_XOTyLDNM/s1600-h/pearson6.png"&gt;&lt;img style="cursor: pointer;" src="http://3.bp.blogspot.com/_wqJeUjTB5sE/SIu8QKudzCI/AAAAAAAABG8/iI_XOTyLDNM/s400/pearson6.png" alt="" id="BLOGGER_PHOTO_ID_5227478778436766754" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;Then, the formula becomes:&lt;br /&gt;&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://2.bp.blogspot.com/_wqJeUjTB5sE/SIu9AfB_5KI/AAAAAAAABHE/WAFXiH_L6DI/s1600-h/pearson7.png"&gt;&lt;img style="cursor: pointer;" src="http://2.bp.blogspot.com/_wqJeUjTB5sE/SIu9AfB_5KI/AAAAAAAABHE/WAFXiH_L6DI/s400/pearson7.png" alt="" id="BLOGGER_PHOTO_ID_5227479608521122978" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;The result will be between -1 and 1 where the closer it is to 1, the closer the match.  A movie compared to itself will score a 1.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;font-size:130%;" &gt;Loading up the Ratings table into memory&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;In order to calculate the pearson coefficient quickly, we need to load up the Rating table into memory and we need to have it sorted by both movieid and userid.&lt;br /&gt;&lt;br /&gt;Here is what &lt;a href="http://www.netflixprize.com/community/viewtopic.php?id=839"&gt;Dan Tillberg writes&lt;/a&gt; in the Netflix Prize forum:&lt;br /&gt;&lt;blockquote&gt;...the trick is to have trainer ratings sorted by both movie and by user.  Then, to find movie-movie similarities for, say, movie m1, you look at all of the *users* that rated movie m1, and then you go through each of those users' *movies* that they've rated to fill in the entire m1'th row of the movie-movie similarity matrix (actually, you only need to bother with movies m2 &lt;= m1, since the other half is symmetrical, as noted further below).  I do this in three minutes on an quad-core processor.  If you calculate it cell by cell instead of row by row it takes a lot longer.&lt;/blockquote&gt;I'll talk about the computation of the coefficient in the next section.  On my 1GB machine, I needed to run java with the following options:&lt;br /&gt;&lt;br /&gt;&lt;pre name="code" class="java"&gt;&lt;br /&gt;java -Xms950m -Xmx950m pearson &gt; pearson.txt&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;&lt;br /&gt;So, the goal is to output the result to a file pearson.txt which I will later load up into MySQL.&lt;br /&gt;&lt;br /&gt;To speed things up, I don't fully sort the ratings by userid and the movieid.  Instead, I group all the ratings by userid and then I group all the ratings by movieid.  This makes the load up into memory go significantly faster.&lt;br /&gt;&lt;br /&gt;Here is the arrays that I used for the load up:&lt;br /&gt;&lt;br /&gt;&lt;pre name="code" class="java"&gt;&lt;br /&gt;private static final int NUM_RECORDS=99072112;&lt;br /&gt;&lt;br /&gt;byte[] ratingByUser = new byte[NUM_RECORDS];&lt;br /&gt;short[] movieByUser = new short[NUM_RECORDS];&lt;br /&gt;&lt;br /&gt;byte[] ratingByMovie = new byte[NUM_RECORDS];&lt;br /&gt;int[] userByMovie = new int[NUM_RECORDS];&lt;br /&gt;&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;&lt;br /&gt;To group each rating, I use the following arrays:&lt;br /&gt;&lt;br /&gt;&lt;pre name="code" class="java"&gt;&lt;br /&gt;private static final int NUM_USERS=480189;&lt;br /&gt;private static final int NUM_MOVIES=17770;&lt;br /&gt;&lt;br /&gt;int[] userIndex = new int[NUM_USERS+1];&lt;br /&gt;int[] userNextPlace = new int[NUM_USERS];&lt;br /&gt;int[] movieIndex = new int[NUM_MOVIES+1];&lt;br /&gt;int[] movieNextPlace = new int[NUM_MOVIES];&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-size:130%;"&gt;&lt;span style="font-weight: bold;"&gt;Some Help from MySQL&lt;/span&gt;&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;To speed up the processing, I create two tables in MySQL that will provide me with the number of movies by user and the number of users by movie.  These will be relatively small tables and that will help in the grouping process.  I will use the term "support" in the same way BellKor where "movie support" is the number of users who rated each movie and "user support" is the number of movies rated each user.&lt;br /&gt;&lt;br /&gt;Here are the two tables:&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;font-size:130%;" &gt;Movie_Support&lt;/span&gt;&lt;br /&gt;&lt;ul&gt;&lt;li&gt;&lt;span style="font-weight: bold;"&gt;movieid&lt;/span&gt;: int(2)&lt;/li&gt;&lt;li&gt;&lt;span style="font-weight: bold;"&gt;support&lt;/span&gt;: int(3)&lt;/li&gt;&lt;/ul&gt;&lt;span style="font-weight: bold;font-size:130%;" &gt;User_Support&lt;/span&gt;&lt;br /&gt;&lt;ul&gt;&lt;li&gt;&lt;span style="font-weight: bold;"&gt;userid&lt;/span&gt;: int(3)&lt;/li&gt;&lt;li&gt;&lt;span style="font-weight: bold;"&gt;support&lt;/span&gt;: int(3)&lt;/li&gt;&lt;/ul&gt;Here's the sql for creating and loading up these tables:&lt;br /&gt;&lt;br /&gt;&lt;pre name="code" class="sql"&gt;&lt;br /&gt;CREATE TABLE movie_support (movieid INT(2), support INT(3));&lt;br /&gt;&lt;br /&gt;CREATE TABLE user_support (userid INT(3), support INT(3));&lt;br /&gt;&lt;br /&gt;INSERT INTO movie_support (movieid,support) SELECT movieid, COUNT(rating) FROM rating GROUP BY movieid;&lt;br /&gt;&lt;br /&gt;INSERT INTO user_support (userid,support) SELECT userid, COUNT(rating) FROM rating GROUP BY userid;&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;&lt;br /&gt;We don't need any indexes for either of these tables.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-size:130%;"&gt;&lt;span style="font-weight: bold;"&gt;Handling UserId&lt;/span&gt;&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;UserId presents an opportunity for reducing storage space.  Now, the userid data presented as part of training varies from 6 to 2,649,429.  But, there are only 480,189 distinct users.&lt;br /&gt;&lt;br /&gt;To address this point, I use a simple hash table to map a relative userid (1 .. 480,189) to the absolute userid (6 .. 2,649,429).  Here's the code to do this:&lt;br /&gt;&lt;br /&gt;&lt;pre name="code" class="java"&gt;&lt;br /&gt;private static final int nextRelId=0;&lt;br /&gt;private static HashMap&amp;lt;integer,integer&amp;gt; userMap = new HashMap&amp;lt;integer,integer&amp;gt;();&lt;br /&gt;&lt;br /&gt;private static int getRelUserId(int userId) {&lt;br /&gt;Integer relUserId = userMap.get(userId);&lt;br /&gt;if (relUserId == null) {&lt;br /&gt;userMap.put(userId,nextRelUserId);&lt;br /&gt;relUserId = nextRelId++;&lt;br /&gt;}&lt;br /&gt;return relUserId;&lt;br /&gt;}&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-size:130%;"&gt;&lt;span style="font-weight: bold;"&gt;Before Grouping Ratings by Movie and User&lt;/span&gt;&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;To prepare for the load up all the ratings and group them by userid and movieid, I used the following java code:&lt;br /&gt;&lt;br /&gt;&lt;pre name="code" class="java"&gt;&lt;br /&gt;userIndex[NUM_USERS] = NUM_RECORDS;&lt;br /&gt;int i=0;&lt;br /&gt;ResultSet rs = stmt.executeQuery("SELECT userid,support FROM user_support");&lt;br /&gt;while (rs.next()) {&lt;br /&gt;int relUserId = getRelUserId(rs.getInt("userid"));&lt;br /&gt;userIndex[relUserId] = i;&lt;br /&gt;userNextPlace[relUserId] = i;&lt;br /&gt;i += rs.getInt("support");&lt;br /&gt;}&lt;br /&gt;rs.close();&lt;br /&gt;&lt;br /&gt;movieIndex[NUM_MOVIES] = NUM_RECORDS;&lt;br /&gt;i = 0;&lt;br /&gt;rs = stmt.executeQuery("SELECT movied,support FROM movie_support");&lt;br /&gt;while (rs.next()) {&lt;br /&gt;int relMovieId = rs.getInt("movieid")-1;&lt;br /&gt;movieIndex[relMovieId]=i;&lt;br /&gt;movieNextPlace[relMovieId]=i;&lt;br /&gt;i += rs.getInt("support");&lt;br /&gt;}&lt;br /&gt;rs.close();&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;&lt;br /&gt;Once, I have these indices set, I am ready to start loading up.  The advantage of these arrays will now become apparent.  We do a very fast single pass through the entire Rating table.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-size:130%;"&gt;&lt;span style="font-weight: bold;"&gt;Grouping the Ratings by User and Movie&lt;/span&gt;&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;Here's the java code for loading up the ratings:&lt;br /&gt;&lt;br /&gt;&lt;pre name="code" class="java"&gt;&lt;br /&gt;rs = stmt.executeQuery("SELECT userid,movieid,rating FROM rating");&lt;br /&gt;while (rs.next()) {&lt;br /&gt;int userId = rs.getInt("userid");&lt;br /&gt;int relUserId = getRelUserId(userid);&lt;br /&gt;short movieId = rs.getShort("movieid");&lt;br /&gt;int relMovieId = movieId-1;&lt;br /&gt;byte rating = rs.getByte("rating");&lt;br /&gt;&lt;br /&gt;movieByUser[userNextPlace[relUserId]] = movieId;&lt;br /&gt;ratingByUser[userNextPlace[relUserId]] = rating;&lt;br /&gt;userNextPlace[relUserId]++;&lt;br /&gt;&lt;br /&gt;userByMovie[movieNextPlace[relMovieId]] = userId;&lt;br /&gt;ratingByMovie[movieNextPlace[relMovieId]] = rating;&lt;br /&gt;movieNextPlace[relMovieId]++;&lt;br /&gt;}&lt;br /&gt;rs.close();&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-size:130%;"&gt;&lt;span style="font-weight: bold;"&gt;Quickly Tallying the Ratings&lt;/span&gt;&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;Once we've loaded up ratings into memory, we are ready to process them.  To do this efficiently, I use a technique that &lt;a href="http://www.netflixprize.com/community/viewtopic.php?id=928"&gt;Newman! posted on the Netflix Prize forum&lt;/a&gt;.  Here's Newman's description:&lt;br /&gt;&lt;br /&gt;&lt;blockquote&gt;If you're using raw score values (1 to 5), you can speed things up significantly by replacing the PearsonIntermediate structure with:&lt;br /&gt;&lt;br /&gt;unsigned int32 viewerCnt[5][5];&lt;br /&gt;&lt;br /&gt;where viewerCnt[m][n] is the number of viewers who gave movie X a score of m+1 and movie Y a score of n+1. Furthermore, based on the total number of viewers of a movie, you can replace int32 with int16 or int8, speeding it up even further.&lt;br /&gt;&lt;br /&gt;On my 1.5G laptop, it took 20 minutes to calculate 17770x17770 correlations and write out a big data file. Of course, you really need to calculate only 17770x17770/2, if you have enough memory to hold all correlation and related values (like average rating from common viewers) in memory, otherwise disk seeking will kill your running time and hard disk when you read the data file later. On a multi-core processor, you can easily parallelize the calculations. So I think on a fast multi-core system, you can calculate all movie correlations in less than 5 minutes.&lt;br /&gt;&lt;/blockquote&gt;&lt;br /&gt;So, here's my code:&lt;br /&gt;&lt;pre name="code" class="java"&gt;&lt;br /&gt;int[][][] values = new int[NUM_MOVIES][5][5];&lt;br /&gt;&lt;br /&gt;for (i=0; i &amp;lt; NUM_MOVIES-1; i++) {&lt;br /&gt;for (int j=i+1; j &amp;lt; NUM_MOVIES; j++) {&lt;br /&gt;for (int k=0; k &amp;lt; 5; k++) {&lt;br /&gt;for (int l=0; l &amp;lt; 5; l++) {&lt;br /&gt;values[j][k][l]=0;&lt;br /&gt;}&lt;br /&gt;}&lt;br /&gt;}&lt;br /&gt;&lt;br /&gt;for (int j=movieIndex[i]; j &amp;lt; movieIndex[i+1]; j++) {&lt;br /&gt;int relUserId = getRelUserId(userByMovie[j]);&lt;br /&gt;for (int k=userIndex[relUserId]; k &amp;lt; userIndex[relUserId+1]; j++) {&lt;br /&gt;if (movieByUser[k]-1 &amp;gt; i) {&lt;br /&gt;values[movieByUser[k]-1][ratingByUser[k]-1][ratingByMovie[j]-1]++;&lt;br /&gt;}&lt;br /&gt;}&lt;br /&gt;}&lt;br /&gt;}&lt;br /&gt;&lt;br /&gt;&lt;/pre&gt;&lt;span style="font-weight: bold;font-size:130%;" &gt;Shrinking the Pearson Coefficient in relation to data available&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;BellKor state that they shrink the pearson coefficient based on the the count of data available.   They use the following formula in the article:&lt;br /&gt;&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://1.bp.blogspot.com/_wqJeUjTB5sE/SIvRiDM_sKI/AAAAAAAABHM/YikXX8xhRrk/s1600-h/pearson9.png"&gt;&lt;img style="cursor: pointer;" src="http://1.bp.blogspot.com/_wqJeUjTB5sE/SIvRiDM_sKI/AAAAAAAABHM/YikXX8xhRrk/s400/pearson9.png" alt="" id="BLOGGER_PHOTO_ID_5227502175399162018" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;Where |U(i,j)| is "the set of users hwo rated both items j and k".&lt;br /&gt;&lt;br /&gt;In the original article, they state that they used "some small α."  In the &lt;a href="http://www.netflixprize.com//community/viewtopic.php?id=808"&gt;NetPrize Forum&lt;/a&gt;, Yehuda Koren thinks that they "used alpha values around 5-10." In my code, I use α = 10.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-size:130%;"&gt;&lt;span style="font-weight: bold;"&gt;Computing the Pearson&lt;/span&gt;&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;With this tally, I am now able to compute the Pearson coefficient for each movie in comparison to i.&lt;br /&gt;&lt;br /&gt;Here's the code:&lt;br /&gt;&lt;br /&gt;&lt;pre name="code" class="java"&gt;&lt;br /&gt;for (i=0; i &amp;lt; NUM_MOVIES-1; i++) {&lt;br /&gt;// Insert the code from above for tallying the totals...&lt;br /&gt;//...&lt;br /&gt;for (int j=i+1; j &amp;lt; NUM_MOVIES; j++) {&lt;br /&gt;float sum1=0;&lt;br /&gt;float sum2=0;&lt;br /&gt;float sumsq1=0;&lt;br /&gt;float sumsq2=0;&lt;br /&gt;float sumpr=0;&lt;br /&gt;float num=0;&lt;br /&gt;for (int k=1; k &amp;lt;= 5; k++) {&lt;br /&gt;for (int l=1; k &amp;lt;= 5; l++) {&lt;br /&gt;int val=values[j][k-1][l-1];&lt;br /&gt;sum1 += l*val;&lt;br /&gt;sum2 += k*val;&lt;br /&gt;sumsq1 = l*l*val;&lt;br /&gt;sumsq2 = k*k*val;&lt;br /&gt;sumpr = k*l*val;&lt;br /&gt;num += val;&lt;br /&gt;}&lt;br /&gt;}&lt;br /&gt;float top = sumpr - (sum1*sum2/num);&lt;br /&gt;float bottom = (float)Math.sqrt((sumsq1 - (sum1*sum1)/num)*(sumsq2 - (sum2*sum2)/num));&lt;br /&gt;if (bottom != 0) {&lt;br /&gt;float pearson = (top/bottom)*(num/(num+10));&lt;br /&gt;System.out.print((i+1)+","+(j+1)+","+num+","+pearson+"\n");&lt;br /&gt;System.out.print((j+1)+","+(i+1)+","+num+","+pearson+"\n");&lt;br /&gt;}&lt;br /&gt;else {&lt;br /&gt;System.out.print((i+1)+","+(j+1)+","+num+","+0+"\n");&lt;br /&gt;}&lt;br /&gt;}&lt;br /&gt;}&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;&lt;span style="font-size:130%;"&gt;&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;Loading Pearson Coefficients into MySQL&lt;/span&gt;&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;Here's the sql that I used to load back up into MySQL&lt;br /&gt;&lt;br /&gt;&lt;pre name="code" class="sql"&gt;&lt;br /&gt;use netflix;&lt;br /&gt;&lt;br /&gt;ALTER TABLE pearson DISABLE KEYS;&lt;br /&gt;&lt;br /&gt;LOAD DATA LOCAL INFILE 'pearson.txt' INTO TABLE pearson&lt;br /&gt;FIELDS TERMINATED BY ','&lt;br /&gt;LINES TERMINATED BY '\n'&lt;br /&gt;(movie1,movie2,num,pearson);&lt;br /&gt;&lt;br /&gt;ALTER TABLE pearson ENABLE KEYS;&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;&lt;br /&gt;That's pretty much it.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-size:130%;"&gt;&lt;span style="font-weight: bold;"&gt;Using Your Pearson Table&lt;/span&gt;&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;For example to compute 10 movies that are most like &lt;a href="http://www.netflix.com/Movie/The_Adventures_of_Robin_Hood/60010997?lnkctr=srchrd-sr&amp;amp;strkid=110560919_0_0"&gt;The Adventures of Robin Hood&lt;/a&gt; (movieid 16840), we do the following:&lt;br /&gt;&lt;pre name="code" class="sql"&gt;&lt;br /&gt;SELECT movie1,pearson FROM pearson WHERE movie2=7064 ORDER BY pearson DESC LIMIT 10;&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;Now, I ran my pearson against the entire training data so your results may be different.&lt;br /&gt;&lt;br /&gt;Here are my results:&lt;br /&gt;&lt;ul&gt;&lt;li&gt;The Adventures of Robin Hood: Bonus Material (movieid: 3032, pearson: 0.64)&lt;/li&gt;&lt;li&gt;&lt;a href="http://www.netflix.com/Movie/Fairly_Oddparents_School_s_Out_The_Musical/70032389?lnkctr=srchrd-sr&amp;amp;strkid=358094168_9_0"&gt;Fairly Odd Parents: School's Out! The Musical&lt;/a&gt; (movieid: 15618, pearson: 0.6234)&lt;/li&gt;&lt;li&gt;&lt;a href="http://www.netflix.com/Movie/Imelda/60035200?lnkctr=srchrd-sr&amp;amp;strkid=2075622327_0_0"&gt;Imelda&lt;/a&gt; (movieid: 15948, pearson: 0.6232)&lt;/li&gt;&lt;li&gt;&lt;a href="http://www.netflix.com/Movie/Tiger_Bay/1047715?lnkctr=srchrd-sr&amp;amp;strkid=1374704112_0_0"&gt;Tiger Bay&lt;/a&gt; (movieid: 815, pearson: 0.61592)&lt;/li&gt;&lt;li&gt;&lt;a href="http://www.netflix.com/Movie/The_Adventures_of_Ociee_Nash/70036131?lnkctr=srchrd-sr&amp;amp;strkid=1877307365_0_0"&gt;The Adventures of Ociee Nash&lt;/a&gt; (movieid: 12979, pearson: 0.61591)&lt;/li&gt;&lt;li&gt;&lt;a href="http://www.netflix.com/Movie/Counsellor_at_Law/60026541?lnkctr=srchrd-sr&amp;amp;strkid=242421363_0_0"&gt;Counsellor at Law&lt;/a&gt; (movieid: 9122, pearson: 0.61)&lt;/li&gt;&lt;li&gt;&lt;a href="http://www.netflix.com/Movie/Empires_Peter_and_Paul_and_the_Christian_Revolution/60031677?lnkctr=srchrd-sr&amp;amp;strkid=1196902231_0_0"&gt;Empires: Peter and Paul and the Christian Revolution&lt;/a&gt; (movieid: 9574, pearson: 0.60)&lt;/li&gt;&lt;li&gt;&lt;a href="http://www.netflix.com/Movie/Hillary_Tenzing_Climbing_to_the_Roof_of_World/60027456?lnkctr=srchrd-sr&amp;amp;strkid=2071886718_0_0"&gt;Hillary &amp;amp; Tenzing: Climbing to the Roof of World&lt;/a&gt; (movieid: 735, pearson: 0.599)&lt;/li&gt;&lt;li&gt;&lt;a href="http://www.netflix.com/Movie/Faerie_Tale_Theatre_Snow_White_and_the_Seven_Dwarfs/70023633?lnkctr=srchrd-sr&amp;amp;strkid=634070475_0_0"&gt;Faerie Tale Theater: Snow White and the Seven Dwarfs&lt;/a&gt; (movieid: 3358, pearson: 0.596)&lt;/li&gt;&lt;li&gt;&lt;a href="http://www.netflix.com/Movie/Stephen_Sondheim_s_Putting_It_Together/70002485?lnkctr=srchrd-sr&amp;amp;strkid=2010609623_0_0"&gt;Stephen Sondheim's Putting It Together&lt;/a&gt; (movieid: 5868, pearson: 0.595)&lt;/li&gt;&lt;/ul&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;font-size:130%;" &gt;&lt;br /&gt;References&lt;br /&gt;&lt;/span&gt;&lt;ul&gt;&lt;li&gt;&lt;a href="http://public.research.att.com/%7Eyehuda/pubs/cf-workshop.pdf"&gt;"Improved Neighborhood-based Collaborative Filtering"&lt;/a&gt;,Robert M. Bell and Yehuda Koren, AT&amp;amp;T Labs – Research&lt;/li&gt;&lt;li&gt;Toby Segaran, &lt;span style="font-style: italic;"&gt;&lt;a href="http://www.amazon.com/gp/redirect.html?ie=UTF8&amp;amp;location=http%3A%2F%2Fwww.amazon.com%2FProgramming-Collective-Intelligence-Building-Applications%2Fdp%2F0596529325%3Fie%3DUTF8%26s%3Dbooks%26qid%3D1217125608%26sr%3D8-1&amp;amp;tag=themovieadvis-20&amp;amp;linkCode=ur2&amp;amp;camp=1789&amp;amp;creative=9325"&gt;Collective Intelligence&lt;/a&gt;&lt;img src="http://www.assoc-amazon.com/e/ir?t=themovieadvis-20&amp;amp;l=ur2&amp;amp;o=1" alt="" style="border: medium none  ! important; margin: 0px ! important;" width="1" border="0" height="1" /&gt;&lt;/span&gt;, O'Reilly, 2007.&lt;/li&gt;&lt;li&gt;&lt;a href="http://www.netflixprize.com//community/viewtopic.php?id=808"&gt;"BellKor Similarity"&lt;/a&gt;, Netflix Prize Forum, Yehuda Koren, Nov 27, 2007.&lt;/li&gt;&lt;li&gt;&lt;a href="http://www.netflixprize.com/community/viewtopic.php?id=928"&gt;"Fast way to computer correlations?"&lt;/a&gt;, Netflix Prize Forum, Newman!, April 23, 2008&lt;/li&gt;&lt;li&gt;&lt;a href="http://www.netflixprize.com/community/viewtopic.php?id=839"&gt;"Calculating similarities between movies by Korbell's Paper"&lt;/a&gt;,Netflix Prize Forum, Dan Tillberg, January 15, 2008.&lt;br /&gt;&lt;/li&gt;&lt;/ul&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8592569895405724581-8249983148533477773?l=algorithmsanalyzed.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://algorithmsanalyzed.blogspot.com/feeds/8249983148533477773/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=8592569895405724581&amp;postID=8249983148533477773' title='29 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/8592569895405724581/posts/default/8249983148533477773'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/8592569895405724581/posts/default/8249983148533477773'/><link rel='alternate' type='text/html' href='http://algorithmsanalyzed.blogspot.com/2008/07/bellkor-algorithm-pearson-correlation.html' title='BellKor Algorithm: Pearson Correlation Coefficient'/><author><name>Larry Freeman</name><uri>http://www.blogger.com/profile/06906614246430481533</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://4.bp.blogspot.com/_wqJeUjTB5sE/SKZUgwi8H5I/AAAAAAAABJ0/miQr_4wF1eg/S220/picture.jpg'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://1.bp.blogspot.com/_wqJeUjTB5sE/SIu6qYUeuBI/AAAAAAAABGU/N5aZr92NArE/s72-c/pearson.png' height='72' width='72'/><thr:total>29</thr:total></entry><entry><id>tag:blogger.com,1999:blog-8592569895405724581.post-8788475394061075914</id><published>2008-05-27T00:37:00.001-07:00</published><updated>2009-11-17T00:13:48.715-08:00</updated><title type='text'>BellKor Algorithm: Global Effects</title><content type='html'>&lt;span style="font-weight: bold;font-size:130%;" &gt;Introduction&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;This article refers to the &lt;a href="http://algorithmsanalyzed.blogspot.com/2008/03/bellkor-algorithm-overview.html"&gt;BellKor Algorithm&lt;/a&gt; which won the &lt;a href="http://algorithmsanalyzed.blogspot.com/2008/02/50000-algorithm.html"&gt;$50,000 Progress Prize&lt;/a&gt; as part of the &lt;a href="http://algorithmsanalyzed.blogspot.com/2008/02/netflix-prize-1-million-challenge.html"&gt;Netflix Prize contest&lt;/a&gt;.  The purpose of this article is to provide details on the winning algorithm.&lt;br /&gt;&lt;br /&gt;The &lt;a href="http://algorithmsanalyzed.blogspot.com/2008/03/bellkor-algorithm-overview.html"&gt;BellKor team&lt;/a&gt; found that removing Global Effects greatly improved the results of applying the&lt;a href="http://algorithmsanalyzed.blogspot.com/2008/03/bellkor-algorithm-neighborhood-based_06.html"&gt; K Nearest Neighbor (knn) Algorithm&lt;/a&gt;.&lt;br /&gt;&lt;br /&gt;In today's blog, I will use MySQL to remove Global Effects in the effort to reproduce the results of the BellKor paper.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;font-size:130%;" &gt;Disclaimer   &lt;/span&gt;&lt;br /&gt;&lt;br /&gt;I am not an expert in learning theory or statistics.  The work done here is my best effort with some support from the &lt;a href="http://www.netflixprize.com//community/"&gt;Netflix Prize forum&lt;/a&gt;. While I was able to reproduce each of the effects that the BellKor identified, I did not get an exact match on some of the numbers. This indicates that the details provided here may not be exactly the same as done by BellKor.&lt;br /&gt;&lt;br /&gt;I am very glad to be corrected by anyone who is able to get an exact match for the BellKor results or is an expert of the field.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-size:130%;"&gt;&lt;span style="font-weight: bold;"&gt;MySQL&lt;/span&gt;&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;I chose MySQL because my goal here is to understand the logic while minimizing the amount of coding involved. While using memory mapped files with C/C++ is probably the fastest way to process data on Windows, I found that MySQL with indexed tables and occasional help from java/jdbc was fine for me.&lt;br /&gt;&lt;br /&gt;For those interested in the steps I followed to load up the Netflix Prize Data into MySQL and remove the probe data, you can see the details &lt;a href="http://setupandconfig.blogspot.com/2008/04/loading-up-netflix-prize-data-into.html"&gt;here&lt;/a&gt;.  The set up described there took me roughly 3 hours to complete.&lt;br /&gt;&lt;br /&gt;If anyone has suggestions for reducing the time it takes or improving the calculations, please feel free to post it here.&lt;br /&gt;&lt;br /&gt;After following this &lt;a href="http://setupandconfig.blogspot.com/2008/04/loading-up-netflix-prize-data-into.html"&gt;process&lt;/a&gt;, I have the following two tables:&lt;br /&gt;&lt;span style="font-size:130%;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;span style="font-weight: bold;font-size:130%;" &gt;Rating Table&lt;/span&gt;&lt;br /&gt;&lt;ul&gt;&lt;li&gt;&lt;span style="font-weight: bold;"&gt;movieid&lt;/span&gt;:  int(2)&lt;/li&gt;&lt;li&gt;&lt;span style="font-weight: bold;"&gt;userid&lt;/span&gt;:  int(3)&lt;/li&gt;&lt;li&gt;&lt;span style="font-weight: bold;"&gt;rating&lt;/span&gt;:  int(1)&lt;/li&gt;&lt;li&gt;&lt;span style="font-weight: bold;"&gt;date&lt;/span&gt;: date&lt;/li&gt;&lt;li&gt;&lt;span style="font-weight: bold;"&gt;first&lt;/span&gt;: date&lt;/li&gt;&lt;li&gt;&lt;span style="font-weight: bold;"&gt;support&lt;/span&gt;: int(3)&lt;/li&gt;&lt;li&gt;&lt;span style="font-weight: bold;"&gt;avg&lt;/span&gt;: double&lt;/li&gt;&lt;li&gt;&lt;span style="font-weight: bold;"&gt;avg2&lt;/span&gt;: double&lt;/li&gt;&lt;li&gt;&lt;span style="font-weight: bold;"&gt;residual&lt;/span&gt;: double&lt;br /&gt;&lt;/li&gt;&lt;/ul&gt;The rating table has &lt;span style="font-weight: bold;"&gt;99,072,112&lt;/span&gt; ratings which do not include the probe table data.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;font-size:130%;" &gt;Probe Table&lt;/span&gt;&lt;br /&gt;&lt;ul&gt;&lt;li&gt;&lt;span style="font-weight: bold;"&gt;movieid&lt;/span&gt;: int(2)&lt;/li&gt;&lt;li&gt;&lt;span style="font-weight: bold;"&gt;userid&lt;/span&gt;:  int(3)&lt;/li&gt;&lt;li&gt;&lt;span style="font-weight: bold;"&gt;rating&lt;/span&gt;:  int(1)&lt;/li&gt;&lt;li&gt;&lt;span style="font-weight: bold;"&gt;date&lt;/span&gt;:  date&lt;/li&gt;&lt;li&gt;&lt;span style="font-weight: bold;"&gt;first&lt;/span&gt;: date&lt;/li&gt;&lt;li&gt;&lt;span style="font-weight: bold;"&gt;support&lt;/span&gt;: int(3)&lt;/li&gt;&lt;li&gt;&lt;span style="font-weight: bold;"&gt;avg&lt;/span&gt;: double&lt;/li&gt;&lt;li&gt;&lt;span style="font-weight: bold;"&gt;avg2&lt;/span&gt;: double&lt;br /&gt;&lt;/li&gt;&lt;li&gt;&lt;span style="font-weight: bold;"&gt;prediction&lt;/span&gt;: double&lt;br /&gt;&lt;/li&gt;&lt;/ul&gt;The probe table has &lt;span style="font-weight: bold;"&gt;1,408,395&lt;/span&gt; ratings.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;font-size:130%;" &gt;Global Effects &lt;/span&gt;&lt;br /&gt;&lt;br /&gt;Some users are easier on movies when they select their rating. If they like or hate a movie, they rate it a bit higher. Some uses are tougher in their ratings and they rate movies just a little bit lower. Both of these tendencies are examples of a "global effect."  In this case, what the BellKor team call a "user effect".  Some movies get higher ratings when they are first released than after they've been out for a while.    This is an example of a "movie x time (movie) effect."&lt;br /&gt;&lt;br /&gt;In their paper, the BellKor team present a strategy of removing a global effect "one 'effect' at a time." In this way, it is possible to test for "effects", identify them, and then remove them. After each removal, the RMSE goes down (see &lt;a href="http://algorithmsanalyzed.blogspot.com/2008/02/netflix-prize-1-million-challenge.html"&gt;here&lt;/a&gt; for explanation about RMSE if needed).&lt;br /&gt;&lt;br /&gt;The idea of "global effects" is very enticing.  You identify an effect, you test for it, and then you remove it for the total.  As long as you keep finding effects, in theory, the closer you get to the goal of an &lt;span style="font-weight: bold;"&gt;RMSE=0.85&lt;/span&gt;.  In practice, the BellKor team found that global effects alone only got them to an &lt;span style="font-weight: bold;"&gt;RMSE=0.96&lt;/span&gt;.  In this paper, I will go through each of the 11 global effects that the BellKor team identified in their &lt;a href="http://public.research.att.com/%7Eyehuda/pubs/cf-workshop.pdf"&gt;paper&lt;/a&gt;.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;font-size:130%;" &gt;Residuals&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;Since the goal is to remove each "effect" one at a time, the plan is not to focus on the ratings themselves but rather to focus on the "residuals" which are the ratings minus an effect.&lt;br /&gt;&lt;br /&gt;To make this point as clear as possible, I will refer to &lt;span style="font-weight: bold;"&gt;r&lt;/span&gt;&lt;sub style="font-weight: bold;"&gt;ui&lt;/sub&gt; as the rating of user &lt;span style="font-weight: bold;"&gt;u&lt;/span&gt; on item &lt;span style="font-weight: bold;"&gt;i&lt;/span&gt;.  I will refer to each residual as &lt;span style="font-weight: bold;"&gt;r&lt;/span&gt;&lt;sub style="font-weight: bold;"&gt;ui&lt;/sub&gt;&lt;span style="font-weight: bold;"&gt;(n)&lt;/span&gt; where &lt;span style="font-weight: bold;"&gt;n&lt;/span&gt; refers to the number of effects that have been removed.&lt;br /&gt;&lt;br /&gt;For example:&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;r&lt;/span&gt;&lt;sub style="font-weight: bold;"&gt;ui&lt;/sub&gt;&lt;span style="font-weight: bold;"&gt;(0) = r&lt;/span&gt;&lt;sub style="font-weight: bold;"&gt;ui&lt;/sub&gt;&lt;span style="font-weight: bold;"&gt;.&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;r&lt;/span&gt;&lt;sub style="font-weight: bold;"&gt;ui&lt;/sub&gt;&lt;span style="font-weight: bold;"&gt;(1) = r&lt;/span&gt;&lt;sub style="font-weight: bold;"&gt;ui&lt;/sub&gt;&lt;span style="font-weight: bold;"&gt;(0) - effect&lt;/span&gt;&lt;sub style="font-weight: bold;"&gt;1&lt;/sub&gt;&lt;span style="font-weight: bold;"&gt;.&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;r&lt;/span&gt;&lt;sub style="font-weight: bold;"&gt;ui&lt;/sub&gt;&lt;span style="font-weight: bold;"&gt;(2) = r&lt;/span&gt;&lt;sub style="font-weight: bold;"&gt;ui&lt;/sub&gt;&lt;span style="font-weight: bold;"&gt;(1) - effect&lt;/span&gt;&lt;sub style="font-weight: bold;"&gt;2&lt;/sub&gt;&lt;br /&gt;&lt;br /&gt;and so on until we have:&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;r&lt;/span&gt;&lt;sub style="font-weight: bold;"&gt;ui&lt;/sub&gt;&lt;span style="font-weight: bold;"&gt;(n) = r&lt;/span&gt;&lt;sub style="font-weight: bold;"&gt;ui&lt;/sub&gt;&lt;span style="font-weight: bold;"&gt;(n-1) - effect&lt;/span&gt;&lt;sub style="font-weight: bold;"&gt;n&lt;/sub&gt;.&lt;br /&gt;&lt;br /&gt;The model for the effect is:&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;effect&lt;/span&gt;&lt;sub style="font-weight: bold;"&gt;n&lt;/sub&gt;&lt;span style="font-weight: bold;"&gt; = θ*x&lt;/span&gt;&lt;sub style="font-weight: bold;"&gt;ui&lt;/sub&gt;&lt;br /&gt;&lt;br /&gt;where:&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;x&lt;sub&gt;ui&lt;/sub&gt; =&lt;/span&gt; variable of interest&lt;br /&gt;&lt;br /&gt;and&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;θ = effect&lt;sub&gt;n&lt;/sub&gt;/x&lt;sub&gt;ui&lt;/sub&gt;&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-size:130%;"&gt;&lt;span style="font-weight: bold;"&gt;Variable of Interest&lt;/span&gt;&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;BellKor distinguishes between three types of effects:&lt;br /&gt;&lt;br /&gt;(1)  Main Effects&lt;br /&gt;&lt;br /&gt;In this case,  &lt;span style="font-weight: bold;"&gt;x&lt;/span&gt;&lt;sub style="font-weight: bold;"&gt;ui&lt;/sub&gt;&lt;span style="font-weight: bold;"&gt; = 1&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;(2)  User Effects&lt;br /&gt;&lt;br /&gt;In this case,  &lt;span style="font-weight: bold;"&gt;x&lt;/span&gt;&lt;sub style="font-weight: bold;"&gt;ui&lt;/sub&gt;&lt;span style="font-weight: bold;"&gt; = x&lt;/span&gt;&lt;sub style="font-weight: bold;"&gt;ui&lt;/sub&gt;&lt;span style="font-weight: bold;"&gt; - avg(x&lt;/span&gt;&lt;sub style="font-weight: bold;"&gt;ui&lt;/sub&gt;&lt;span style="font-weight: bold;"&gt; &lt;/span&gt;for a given&lt;span style="font-weight: bold;"&gt; u)&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;(3)  Item Effects&lt;br /&gt;&lt;br /&gt;In this case, &lt;span style="font-weight: bold;"&gt;x&lt;/span&gt;&lt;sub style="font-weight: bold;"&gt;ui&lt;/sub&gt;&lt;span style="font-weight: bold;"&gt; = x&lt;/span&gt;&lt;sub style="font-weight: bold;"&gt;ui&lt;/sub&gt;&lt;span style="font-weight: bold;"&gt; - avg(x&lt;/span&gt;&lt;sub style="font-weight: bold;"&gt;ui&lt;/sub&gt; for a given &lt;span style="font-weight: bold;"&gt;i)&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;They succinctly call this out by writing:&lt;br /&gt;&lt;blockquote&gt;For other global effects, we center &lt;span style="font-weight: bold;"&gt;x&lt;/span&gt;&lt;sub style="font-weight: bold;"&gt;ui&lt;/sub&gt; for each user by subtracting the mean of &lt;span style="font-weight: bold;"&gt;x&lt;/span&gt;&lt;sub style="font-weight: bold;"&gt;ui&lt;/sub&gt; for that user.&lt;/blockquote&gt;&lt;span style="font-weight: bold;font-size:130%;" &gt;Estimating θ &lt;/span&gt;&lt;br /&gt;&lt;br /&gt;To estimate &lt;span style="font-weight: bold;"&gt;θ&lt;/span&gt;, the BellKor paper starts by looking at the idea of an unbiased estimator.&lt;br /&gt;&lt;br /&gt;An estimator is a function on sample variables that "estimates" a population variable. An "unbiased estimator" is one where if the sample size is large enough, the estimator is equal to the population variable estimated.&lt;br /&gt;&lt;br /&gt;BellKor proposes the following function for estimating the "effect" using an unbiased estimator:&lt;br /&gt;&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://1.bp.blogspot.com/_wqJeUjTB5sE/SBywB8oPiTI/AAAAAAAAA9M/0QlsiDyERic/s1600-h/unbiasedestimator.png"&gt;&lt;img style="cursor: pointer;" src="http://1.bp.blogspot.com/_wqJeUjTB5sE/SBywB8oPiTI/AAAAAAAAA9M/0QlsiDyERic/s400/unbiasedestimator.png" alt="" id="BLOGGER_PHOTO_ID_5196221617579985202" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;Note: you can tell we are using an "estimator" because of the "hat" atop &lt;span style="font-weight: bold;"&gt;θ&lt;/span&gt;.&lt;br /&gt;&lt;br /&gt;The problem with this approach is that in many cases, the sample size is not very large for a given user and there is always the chance that the sample is not large enough for a given item.&lt;br /&gt;&lt;span style="font-size:130%;"&gt;&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;Using a Posterior Mean&lt;/span&gt;&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;To compensate for this, BellKor use the concept of a posterior mean. This can be derived by making assumptions about the normal distribution.&lt;br /&gt;&lt;br /&gt;If we assume that for &lt;span style="font-weight: bold;"&gt;θ&lt;/span&gt;, the population mean is &lt;span style="font-weight: bold;"&gt;μ&lt;/span&gt; and the population variance is &lt;span style="font-weight: bold;"&gt;τ&lt;/span&gt;&lt;sup style="font-weight: bold;"&gt;2&lt;/sup&gt;, then &lt;span style="font-weight: bold;"&gt;θ&lt;/span&gt; can be characterized:&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;θ ∼ N(μ,τ&lt;/span&gt;&lt;sup style="font-weight: bold;"&gt;2&lt;/sup&gt;&lt;span style="font-weight: bold;"&gt;)&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;Further, using the sample variance &lt;span style="font-weight: bold;"&gt;σ&lt;/span&gt;&lt;sup style="font-weight: bold;"&gt;2&lt;/sup&gt;, the BellKor paper notes:&lt;br /&gt;&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://4.bp.blogspot.com/_wqJeUjTB5sE/SBy7XsoPiUI/AAAAAAAAA9U/oHyAL7wuXAU/s1600-h/posteriormean1.png"&gt;&lt;img style="cursor: pointer;" src="http://4.bp.blogspot.com/_wqJeUjTB5sE/SBy7XsoPiUI/AAAAAAAAA9U/oHyAL7wuXAU/s400/posteriormean1.png" alt="" id="BLOGGER_PHOTO_ID_5196234085870045506" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;Using these two observations, they now restate the estimator for &lt;span style="font-weight: bold;"&gt;θ&lt;/span&gt; as its posterior mean:&lt;br /&gt;&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://2.bp.blogspot.com/_wqJeUjTB5sE/SBy8iMoPiVI/AAAAAAAAA9c/KzHXTS4G-Ow/s1600-h/posteriormean2.png"&gt;&lt;img style="cursor: pointer;" src="http://2.bp.blogspot.com/_wqJeUjTB5sE/SBy8iMoPiVI/AAAAAAAAA9c/KzHXTS4G-Ow/s400/posteriormean2.png" alt="" id="BLOGGER_PHOTO_ID_5196235365770299730" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;&lt;span style="font-size:130%;"&gt; &lt;span style="font-weight: bold;"&gt;Assumptions on μ and σ&lt;/span&gt;&lt;sup style="font-weight: bold;"&gt;2&lt;/sup&gt;&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;Now, since all parameters are centered (that is, &lt;span style="font-weight: bold;"&gt;x = x - avg(x)&lt;/span&gt;), we can assume that the population mean is &lt;span style="font-weight: bold;"&gt;0&lt;/span&gt;.  That is &lt;span style="font-weight: bold;"&gt;μ = 0&lt;/span&gt;.&lt;br /&gt;&lt;br /&gt;BellKor further assume that &lt;span style="font-weight: bold;"&gt;σ&lt;/span&gt;&lt;sup style="font-weight: bold;"&gt;2&lt;/sup&gt; is proportional ot &lt;span style="font-weight: bold;"&gt;1/count(r&lt;/span&gt;&lt;sub style="font-weight: bold;"&gt;ui&lt;/sub&gt;&lt;span style="font-weight: bold;"&gt;)&lt;/span&gt;.  If we assume that &lt;span style="font-weight: bold;"&gt;n = count(r&lt;/span&gt;&lt;sub style="font-weight: bold;"&gt;ui&lt;/sub&gt;&lt;span style="font-weight: bold;"&gt;)&lt;/span&gt; then there exists a &lt;span style="font-weight: bold;"&gt;k&lt;/span&gt; such that&lt;span style="font-weight: bold;"&gt; σ&lt;/span&gt;&lt;sup style="font-weight: bold;"&gt;2&lt;/sup&gt;&lt;span style="font-weight: bold;"&gt; = k/n&lt;/span&gt;.&lt;br /&gt;&lt;br /&gt;Then, the revised formula is:&lt;br /&gt;&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://4.bp.blogspot.com/_wqJeUjTB5sE/SBy-YsoPiWI/AAAAAAAAA9k/jpxEsuT05rE/s1600-h/posteriormean3.png"&gt;&lt;img style="cursor: pointer;" src="http://4.bp.blogspot.com/_wqJeUjTB5sE/SBy-YsoPiWI/AAAAAAAAA9k/jpxEsuT05rE/s400/posteriormean3.png" alt="" id="BLOGGER_PHOTO_ID_5196237401584798050" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;BellKor now states a very complicated formula for estimating &lt;span style="font-weight: bold;"&gt;τ&lt;/span&gt;&lt;sup style="font-weight: bold;"&gt;2&lt;/sup&gt;:&lt;br /&gt;&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://2.bp.blogspot.com/_wqJeUjTB5sE/SBy_lMoPiXI/AAAAAAAAA9s/VRCt9wxadP0/s1600-h/posteriormean4.png"&gt;&lt;img style="cursor: pointer;" src="http://2.bp.blogspot.com/_wqJeUjTB5sE/SBy_lMoPiXI/AAAAAAAAA9s/VRCt9wxadP0/s400/posteriormean4.png" alt="" id="BLOGGER_PHOTO_ID_5196238715844790642" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;Ouch.  Well, let's simplify it using the above assumptions.  This gets us to:&lt;br /&gt;&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://2.bp.blogspot.com/_wqJeUjTB5sE/SBzAUMoPiYI/AAAAAAAAA90/LAz4psQFCO0/s1600-h/posteriormean5.png"&gt;&lt;img style="cursor: pointer;" src="http://2.bp.blogspot.com/_wqJeUjTB5sE/SBzAUMoPiYI/AAAAAAAAA90/LAz4psQFCO0/s400/posteriormean5.png" alt="" id="BLOGGER_PHOTO_ID_5196239523298642306" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;Better.  But then, BellKor do a "magic" step which is not clear to me at this point.  They say that&lt;br /&gt;it is possible to greatly simplify the above analysis into the following formula:&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;effect =&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://1.bp.blogspot.com/_wqJeUjTB5sE/SBzCg8oPiZI/AAAAAAAAA98/mF-RXGqFrqU/s1600-h/posteriormean6.png"&gt;&lt;img style="cursor: pointer;" src="http://1.bp.blogspot.com/_wqJeUjTB5sE/SBzCg8oPiZI/AAAAAAAAA98/mF-RXGqFrqU/s400/posteriormean6.png" alt="" id="BLOGGER_PHOTO_ID_5196241941365229970" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;where &lt;span style="font-weight: bold;"&gt;α&lt;/span&gt; is a value calculated through cross-validation and &lt;span style="font-weight: bold;"&gt;n&lt;/span&gt; is the number of ratings.&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-size:130%;"&gt;&lt;span style="font-weight: bold;"&gt;Resulting Formula for Effect&lt;/span&gt;&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;So, after all the above analysis, the formula for effect is as follows:&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;effect =&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://4.bp.blogspot.com/_wqJeUjTB5sE/SBzDpsoPiaI/AAAAAAAAA-E/MFHWZqew_xM/s1600-h/posteriormean7.png"&gt;&lt;img style="cursor: pointer;" src="http://4.bp.blogspot.com/_wqJeUjTB5sE/SBzDpsoPiaI/AAAAAAAAA-E/MFHWZqew_xM/s400/posteriormean7.png" alt="" id="BLOGGER_PHOTO_ID_5196243191200713122" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://3.bp.blogspot.com/_wqJeUjTB5sE/SBzE8coPibI/AAAAAAAAA-M/Yon_bcAJlzM/s1600-h/posteriormean8.png"&gt;&lt;img style="cursor: pointer;" src="http://3.bp.blogspot.com/_wqJeUjTB5sE/SBzE8coPibI/AAAAAAAAA-M/Yon_bcAJlzM/s400/posteriormean8.png" alt="" id="BLOGGER_PHOTO_ID_5196244612834888114" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;&lt;span style="font-size:130%;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;span style="font-weight: bold;font-size:130%;" &gt;The List of Global Effects&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;In their paper, BellKor list the following "effects":&lt;br /&gt;&lt;ol&gt;&lt;li&gt;Overall Mean (RMSE = &lt;span style="font-weight: bold;"&gt;1.1296&lt;/span&gt;)&lt;br /&gt;&lt;/li&gt;&lt;li&gt;Movie Effect (RMSE = &lt;span style="font-weight: bold;"&gt;1.0527&lt;/span&gt;)&lt;br /&gt;&lt;/li&gt;&lt;li&gt;User Effect (RMSE = &lt;span style="font-weight: bold;"&gt;0.9841&lt;/span&gt;)&lt;br /&gt;&lt;/li&gt;&lt;li&gt;User x Time (user)&lt;sup&gt;1/2&lt;/sup&gt; (RMSE = &lt;span style="font-weight: bold;"&gt;0.9809&lt;/span&gt;)&lt;/li&gt;&lt;li&gt;User x Time (movie)&lt;sup&gt;1/2&lt;/sup&gt; (RMSE = &lt;span style="font-weight: bold;"&gt;0.9786&lt;/span&gt;)&lt;/li&gt;&lt;li&gt;Movie x Time (movie)&lt;sup&gt;1/2&lt;/sup&gt; (RMSE = &lt;span style="font-weight: bold;"&gt;0.9767&lt;/span&gt;)&lt;/li&gt;&lt;li&gt;Movie x Time (user)&lt;sup&gt;1/2&lt;/sup&gt; (RMSE = &lt;span style="font-weight: bold;"&gt;0.9759&lt;/span&gt;)&lt;/li&gt;&lt;li&gt;User x Movie average (RMSE = &lt;span style="font-weight: bold;"&gt;0.9719&lt;/span&gt;)&lt;/li&gt;&lt;li&gt;User x Movie support (RMSE = &lt;span style="font-weight: bold;"&gt;0.9690&lt;/span&gt;)&lt;/li&gt;&lt;li&gt;Movie x User average (RMSE = &lt;span style="font-weight: bold;"&gt;0.9670&lt;/span&gt;)&lt;/li&gt;&lt;li&gt;Movie x User support (RMSE = &lt;span style="font-weight: bold;"&gt;0.9657&lt;/span&gt;)&lt;br /&gt;&lt;/li&gt;&lt;/ol&gt;First, I go into more detail about cross validation and the resulting &lt;span style="font-weight: bold;"&gt;α&lt;/span&gt;.  Then, I will now proceed to show my effort to reproduce each effect using sql.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-size:130%;"&gt;&lt;span style="font-weight: bold;"&gt;Cross-Validation Values&lt;/span&gt;&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;Yehuda Koren of BellKor has provided the &lt;a href="http://www.netflixprize.com/community/viewtopic.php?id=772"&gt;alpha values&lt;/a&gt; that they used:&lt;br /&gt;&lt;blockquote&gt;Movie: &lt;span style="font-weight: bold;"&gt;25, 1.0527&lt;/span&gt;&lt;br /&gt;User: &lt;span style="font-weight: bold;"&gt;7, 0.9841&lt;/span&gt;&lt;br /&gt;User-time(user): &lt;span style="font-weight: bold;"&gt;550, 0.9809&lt;/span&gt;&lt;br /&gt;User-time(movie): &lt;span style="font-weight: bold;"&gt;150, 0.9786&lt;/span&gt;&lt;br /&gt;Movie-time(movie): &lt;span style="font-weight: bold;"&gt;4000, 0.9767&lt;/span&gt;&lt;br /&gt;Movie-time(user): &lt;span style="font-weight: bold;"&gt;500, 0.9759&lt;/span&gt;&lt;br /&gt;User-mean(movie): &lt;span style="font-weight: bold;"&gt;90, 0.9719&lt;/span&gt;&lt;br /&gt;User-support(movie): &lt;span style="font-weight: bold;"&gt;90, 0.9690&lt;/span&gt;&lt;br /&gt;Movie-mean(user): &lt;span style="font-weight: bold;"&gt;50, 0.9670&lt;/span&gt;&lt;br /&gt;Movie-support(user): &lt;span style="font-weight: bold;"&gt;50, 0.9658&lt;br /&gt;&lt;/span&gt;&lt;/blockquote&gt;In general, cross-validation is done simply by changing values.  For example, if you use user_effect and set &lt;span style="font-weight: bold;"&gt;α = 6&lt;/span&gt; and then &lt;span style="font-weight: bold;"&gt;α = 8&lt;/span&gt;, you will see the following pattern:&lt;br /&gt;&lt;br /&gt;User effect: α = &lt;span style="font-weight: bold;"&gt;6&lt;/span&gt;, RMSE = &lt;span style="font-weight: bold;"&gt;0.984506&lt;/span&gt;&lt;br /&gt;User effect: α = &lt;span style="font-weight: bold;"&gt;7&lt;/span&gt;, RMSE =&lt;span style="font-weight: bold;"&gt;0.984436&lt;/span&gt;&lt;br /&gt;User effect: α = &lt;span style="font-weight: bold;"&gt;8&lt;/span&gt;, RMSE = &lt;span style="font-weight: bold;"&gt;0.984434&lt;/span&gt;&lt;br /&gt;User effect: α = &lt;span style="font-weight: bold;"&gt;9&lt;/span&gt;, RMSE = &lt;span style="font-weight: bold;"&gt;0.984485&lt;/span&gt;&lt;br /&gt;&lt;span style="font-weight: bold;font-size:130%;" &gt;&lt;br /&gt;Overall Mean&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;The formulas for the over all mean are:&lt;br /&gt;&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://1.bp.blogspot.com/_wqJeUjTB5sE/SCkJcKJ5ISI/AAAAAAAABAE/FVgX9JWxpmk/s1600-h/global1.png"&gt;&lt;img style="cursor: pointer;" src="http://1.bp.blogspot.com/_wqJeUjTB5sE/SCkJcKJ5ISI/AAAAAAAABAE/FVgX9JWxpmk/s400/global1.png" alt="" id="BLOGGER_PHOTO_ID_5199697624142782754" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;Using sql, we have:&lt;br /&gt;&lt;br /&gt;&lt;pre name="code" class="sql"&gt;&lt;br /&gt;select sum(rating)/count(rating) from rating;&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;&lt;br /&gt;After less than a minute, the result is &lt;span style="font-weight: bold;"&gt;3.6033&lt;/span&gt;.&lt;br /&gt;&lt;br /&gt;To test out rmse for the overall mean, I run:&lt;br /&gt;&lt;br /&gt;&lt;pre name="code" class="sql"&gt;&lt;br /&gt;update probe set prediction = 3.6033;&lt;br /&gt;&lt;br /&gt;create view rmse_probe as select sqrt(sum((rating - prediction)*(rating - prediction))/count(rating)) "rmse" from probe;&lt;br /&gt;&lt;br /&gt;select * from rmse_probe;&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;&lt;br /&gt;From this select, I get &lt;span style="font-weight: bold;"&gt;RMSE = 1.1296&lt;/span&gt;.  That's a match to the BellKor results.&lt;br /&gt;&lt;br /&gt;This then gives us:&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;r&lt;/span&gt;&lt;sub style="font-weight: bold;"&gt;ui&lt;/sub&gt;&lt;span style="font-weight: bold;"&gt;(1) = r&lt;/span&gt;&lt;sub style="font-weight: bold;"&gt;ui&lt;/sub&gt;&lt;span style="font-weight: bold;"&gt; - [overall  mean]&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;To update, the rating table, we do the following:&lt;br /&gt;&lt;br /&gt;&lt;pre name="code" class="sql"&gt;&lt;br /&gt;update rating set residual = rating - 3.6033;&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;&lt;br /&gt;This took almost 9 minutes on my machine.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-size:130%;"&gt;&lt;span style="font-weight: bold;"&gt;Movie Effect&lt;/span&gt;&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;The formula here is:&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;x&lt;/span&gt;&lt;sub style="font-weight: bold;"&gt;ui&lt;/sub&gt;&lt;span style="font-weight: bold;"&gt; = 1&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://1.bp.blogspot.com/_wqJeUjTB5sE/SCkLeKJ5IXI/AAAAAAAABAs/8ItM5TdLSVU/s1600-h/global2.png"&gt;&lt;img style="cursor: pointer;" src="http://1.bp.blogspot.com/_wqJeUjTB5sE/SCkLeKJ5IXI/AAAAAAAABAs/8ItM5TdLSVU/s400/global2.png" alt="" id="BLOGGER_PHOTO_ID_5199699857525776754" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://3.bp.blogspot.com/_wqJeUjTB5sE/SCkLmqJ5IYI/AAAAAAAABA0/djeiVqRsu88/s1600-h/global3.png"&gt;&lt;img style="cursor: pointer;" src="http://3.bp.blogspot.com/_wqJeUjTB5sE/SCkLmqJ5IYI/AAAAAAAABA0/djeiVqRsu88/s400/global3.png" alt="" id="BLOGGER_PHOTO_ID_5199700003554664834" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;movie effect = θ&lt;/span&gt;&lt;sub style="font-weight: bold;"&gt;i&lt;/sub&gt;&lt;span style="font-weight: bold;"&gt;*x&lt;/span&gt;&lt;sub style="font-weight: bold;"&gt;ui&lt;/sub&gt;&lt;span style="font-weight: bold;"&gt; = θ&lt;/span&gt;&lt;sub style="font-weight: bold;"&gt;i&lt;/sub&gt;&lt;span style="font-weight: bold;"&gt;*1 = θ&lt;/span&gt;&lt;sub style="font-weight: bold;"&gt;i&lt;/sub&gt;&lt;br /&gt;&lt;br /&gt;The SQL for this is:&lt;br /&gt;&lt;br /&gt;&lt;pre name="code" class="sql"&gt;&lt;br /&gt;create table movie_effect (movieid int(2), sum_r_ui_x_ui double, sum_x_ui_squared double, n_i int(3), theta_i_hat double, theta_i double);&lt;br /&gt;&lt;br /&gt;insert into movie_effect (movieid, sum_r_ui_x_ui, sum_x_ui_squared, n_i) select movieid, sum(residual), count(residual), count(residual) from rating group by movieid;&lt;br /&gt;&lt;br /&gt;update movie_effect set theta_i_hat = sum_r_ui_x_ui/sum_x_ui_squared;&lt;br /&gt;&lt;br /&gt;update movie_effect set theta_i = (n_i*theta_i_hat)/(n_i + 25);&lt;br /&gt;&lt;br /&gt;create index index_movie_effect_movieid on movie_effect(movieid);&lt;br /&gt;&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;&lt;br /&gt;On my machine, it completed in roughly 3 minutes.&lt;br /&gt;&lt;br /&gt;To test it on probe, you do the following:&lt;br /&gt;&lt;br /&gt;&lt;pre name="code" class="sql"&gt;&lt;br /&gt;alter table movie_effect engine=memory;&lt;br /&gt;&lt;br /&gt;update probe p set prediction = prediction + (select theta_i from movie_effect where movieid = p.movieid);&lt;br /&gt;&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;&lt;br /&gt;The RMSE that I get here is: &lt;span style="font-weight: bold;"&gt;1.0527.  &lt;/span&gt;That's a match to the BellKor results.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;r&lt;/span&gt;&lt;sub style="font-weight: bold;"&gt;ui&lt;/sub&gt;&lt;span style="font-weight: bold;"&gt;(2) = &lt;/span&gt;&lt;span style="font-weight: bold;"&gt;r&lt;sub&gt;ui&lt;/sub&gt;(1)&lt;/span&gt;&lt;span style="font-weight: bold;"&gt; - [movie effect]&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;To update, the rating table, we do the following:&lt;br /&gt;&lt;br /&gt;&lt;pre name="code" class="sql"&gt;&lt;br /&gt;&lt;br /&gt;update rating r set residual = residual - (select theta_i from movie_effect where movieid=r.movieid);&lt;br /&gt;&lt;br /&gt;alter table movie_effect engine=myisam;&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-size:130%;"&gt; &lt;span style="font-weight: bold;"&gt;User Effect&lt;/span&gt;&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;The formula here is:&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;x&lt;/span&gt;&lt;sub style="font-weight: bold;"&gt;ui&lt;/sub&gt;&lt;span style="font-weight: bold;"&gt; = 1&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://2.bp.blogspot.com/_wqJeUjTB5sE/SCkbHaJ5IZI/AAAAAAAABA8/80TP2MU7tCo/s1600-h/global4.png"&gt;&lt;img style="cursor: pointer;" src="http://2.bp.blogspot.com/_wqJeUjTB5sE/SCkbHaJ5IZI/AAAAAAAABA8/80TP2MU7tCo/s400/global4.png" alt="" id="BLOGGER_PHOTO_ID_5199717058869797266" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://1.bp.blogspot.com/_wqJeUjTB5sE/SCkbQKJ5IaI/AAAAAAAABBE/WQD3OVpMvHM/s1600-h/global5.png"&gt;&lt;img style="cursor: pointer;" src="http://1.bp.blogspot.com/_wqJeUjTB5sE/SCkbQKJ5IaI/AAAAAAAABBE/WQD3OVpMvHM/s400/global5.png" alt="" id="BLOGGER_PHOTO_ID_5199717209193652642" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;user effect = θ&lt;/span&gt;&lt;sub style="font-weight: bold;"&gt;u&lt;/sub&gt;&lt;span style="font-weight: bold;"&gt;*x&lt;/span&gt;&lt;sub style="font-weight: bold;"&gt;ui&lt;/sub&gt;&lt;span style="font-weight: bold;"&gt; = θ&lt;/span&gt;&lt;sub style="font-weight: bold;"&gt;u&lt;/sub&gt;&lt;span style="font-weight: bold;"&gt;*1 = θ&lt;/span&gt;&lt;sub style="font-weight: bold;"&gt;u&lt;/sub&gt;&lt;br /&gt;&lt;br /&gt;Here's the SQL (note: on my system, the user_effect table is too big for memory, so I created a special table memory_user_effect):&lt;br /&gt;&lt;br /&gt;&lt;pre name="code" class="sql"&gt;&lt;br /&gt;create table user_effect (userid int(3), sum_r_ui_x_ui double, sum_x_ui_squared double, n_u int(3), theta_u_hat double, theta_u double);&lt;br /&gt;&lt;br /&gt;insert into user_effect (userid, sum_r_ui_x_ui,sum_x_ui_squared,n_u) select userid, sum(residual), count(residual),count(residual) from rating group by userid;&lt;br /&gt;&lt;br /&gt;update user_effect set theta_u_hat = sum_r_ui_x_ui/sum_x_ui_squared;&lt;br /&gt;&lt;br /&gt;update user_effect set theta_u = (n_u*theta_u_hat)/(n_u + 7);&lt;br /&gt;&lt;br /&gt;create index index_user_effect_userid on user_effect(userid);&lt;br /&gt;&lt;br /&gt;set max_heap_table_size=32000000;&lt;br /&gt;&lt;br /&gt;alter table user_effect engine=memory;&lt;br /&gt;&lt;br /&gt;update probe p set prediction = prediction + (select theta_u from user_effect where userid=p.userid);&lt;br /&gt;&lt;br /&gt;update probe set prediction = 5 where prediction &amp;gt; 5;&lt;br /&gt;&lt;br /&gt;update probe set prediction = 1 where prediction &amp;lt; 1;&lt;br /&gt;&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;&lt;br /&gt;The RMSE I get here is:  &lt;span style="font-weight: bold;"&gt;0.98404.  &lt;/span&gt;That's a match to the BellKor results.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;r&lt;/span&gt;&lt;sub style="font-weight: bold;"&gt;ui&lt;/sub&gt;&lt;span style="font-weight: bold;"&gt;(3) = &lt;/span&gt;&lt;span style="font-weight: bold;"&gt;r&lt;sub&gt;ui&lt;/sub&gt;(2) - [user effect]&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;To update, the rating table, we do the following:&lt;br /&gt;&lt;br /&gt;&lt;pre name="code" class="sql"&gt;&lt;br /&gt;update rating r set residual = residual - (select theta_u from user_effect where userid=r.userid);&lt;br /&gt;&lt;br /&gt;alter table user_effect engine=myisam;&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-size:130%;"&gt;&lt;span style="font-weight: bold;"&gt;User x Time (user)&lt;/span&gt;&lt;sup style="font-weight: bold;"&gt;1/2&lt;/sup&gt;&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;The formula here is:&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;x&lt;/span&gt;&lt;sub style="font-weight: bold;"&gt;ui&lt;/sub&gt;&lt;span style="font-weight: bold;"&gt; = √&lt;/span&gt;&lt;span style="text-decoration: overline; font-weight: bold;"&gt;days&lt;/span&gt;&lt;span style="font-weight: bold;"&gt; - avg(√&lt;/span&gt;&lt;span style="text-decoration: overline; font-weight: bold;"&gt;days&lt;/span&gt;&lt;span style="font-weight: bold;"&gt;)&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://2.bp.blogspot.com/_wqJeUjTB5sE/SCpuyKJ5IlI/AAAAAAAABCc/RZ8Yp4oIghs/s1600-h/globalfix1.png"&gt;&lt;img style="cursor: pointer;" src="http://2.bp.blogspot.com/_wqJeUjTB5sE/SCpuyKJ5IlI/AAAAAAAABCc/RZ8Yp4oIghs/s400/globalfix1.png" alt="" id="BLOGGER_PHOTO_ID_5200090527751021138" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://1.bp.blogspot.com/_wqJeUjTB5sE/SClGbKJ5IcI/AAAAAAAABBU/GXzAcTcZaNo/s1600-h/global7.png"&gt;&lt;img style="cursor: pointer;" src="http://1.bp.blogspot.com/_wqJeUjTB5sE/SClGbKJ5IcI/AAAAAAAABBU/GXzAcTcZaNo/s400/global7.png" alt="" id="BLOGGER_PHOTO_ID_5199764677172208066" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;user x time (user) effect =  θ&lt;sub&gt;u&lt;/sub&gt;*&lt;/span&gt;[&lt;span style="font-weight: bold;"&gt;√&lt;/span&gt;&lt;span style="text-decoration: overline; font-weight: bold;"&gt;days&lt;/span&gt;&lt;span style="font-weight: bold;"&gt; - avg(√&lt;/span&gt;&lt;span style="text-decoration: overline; font-weight: bold;"&gt;days&lt;/span&gt;&lt;span style="font-weight: bold;"&gt;)&lt;/span&gt;]&lt;br /&gt;&lt;br /&gt;where &lt;span style="font-weight: bold;"&gt;√&lt;/span&gt;&lt;span style="text-decoration: overline; font-weight: bold;"&gt;days&lt;/span&gt;&lt;span style="font-weight: bold;"&gt; = sqrt(numOfDaysBetween(date,min(date) &lt;/span&gt;for user))&lt;br /&gt;&lt;br /&gt;Here's what I did for the sql:&lt;br /&gt;&lt;br /&gt;(1)  Populate the "first" date field of tables: rating and probe&lt;br /&gt;&lt;br /&gt;&lt;pre name="code" class="sql"&gt;&lt;br /&gt;create table user_first_rating as select userid,min(date) "first" from rating group by userid;&lt;br /&gt;&lt;br /&gt;create index index_user_first_rating_userid on user_first_rating(userid);&lt;br /&gt;&lt;br /&gt;alter table user_first_rating engine=memory;&lt;br /&gt;&lt;br /&gt;update rating r set first = (select first from user_first_rating where userid = r.userid);&lt;br /&gt;&lt;br /&gt;update probe p set first = (select first from user_first_rating where userid = p.userid);&lt;br /&gt;&lt;br /&gt;update probe p set first = date where datediff(date,first) &amp;lt; 0;&lt;/pre&gt;&lt;br /&gt;&lt;br /&gt;(2)  Populate "avg" field of tables: rating and probe&lt;br /&gt;&lt;br /&gt;&lt;pre name="code" class="sql"&gt;&lt;br /&gt;create table user_avg_num_days_for_user (userid int(3), avg double);&lt;br /&gt;&lt;br /&gt;insert into user_avg_num_days_for_user (userid,avg) select userid,avg(sqrt(datediff(date,first))) from rating group by userid;&lt;br /&gt;&lt;br /&gt;create index index_user_avg_num_days_for_user_userid on user_avg_num_days_for_user(userid);&lt;br /&gt;&lt;br /&gt;alter table user_avg_num_days_for_user engine=memory;&lt;br /&gt;&lt;br /&gt;update rating r set avg = (select avg from user_avg_num_days_for_user where userid = r.userid);&lt;br /&gt;&lt;br /&gt;update probe p set avg = (select avg from user_avg_num_days_for_user where userid = p.userid);&lt;br /&gt;&lt;br /&gt;alter table user_avg_num_days_for_user engine=myisam;&lt;br /&gt;&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;&lt;br /&gt;(3)  Compute user_time_user_effect&lt;br /&gt;&lt;br /&gt;&lt;pre name="code" class="sql"&gt;&lt;br /&gt;create table user_time_user_effect (userid int(3), sum_r_ui_x_ui double, sum_x_ui_squared double, n_u int(3), theta_u_hat double, theta_u double);&lt;br /&gt;&lt;br /&gt;insert into user_time_user_effect (userid,sum_r_ui_x_ui,sum_x_ui_squared,n_u) select userid,sum(residual*(sqrt(datediff(date,first)) - avg)),sum((sqrt(datediff(date,first)) - avg)*(sqrt(datediff(date,first))-avg)),count(residual) from rating group by userid;&lt;br /&gt;&lt;br /&gt;update user_time_user_effect set theta_u_hat = sum_r_ui_x_ui/sum_x_ui_squared;&lt;br /&gt;&lt;br /&gt;update user_time_user_effect set theta_u = (n_u*theta_u_hat)/(n_u + 550);&lt;br /&gt;&lt;br /&gt;update user_time_user_effect set theta_u = 0 where theta_u is null;&lt;br /&gt;&lt;br /&gt;create index index_user_time_user_effect on user_time_user_effect(userid);&lt;br /&gt;&lt;br /&gt;alter table user_time_user_effect engine=memory;&lt;br /&gt;&lt;br /&gt;update probe p set prediction = prediction + (select theta_u*(sqrt(datediff(date,first)) - avg) from user_time_user_effect where userid=p.userid);&lt;br /&gt;&lt;br /&gt;update probe set prediction = 5 where prediction &amp;gt; 5;&lt;br /&gt;&lt;br /&gt;update probe set prediction = 1 where prediction &amp;lt; 1;&lt;br /&gt;&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;&lt;br /&gt;The RMSE I get here is: &lt;span style="font-weight: bold;"&gt;0.9809.  &lt;/span&gt;That's a match to the BellKor results.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;r&lt;/span&gt;&lt;sub style="font-weight: bold;"&gt;ui&lt;/sub&gt;&lt;span style="font-weight: bold;"&gt;(4) = &lt;/span&gt;&lt;span style="font-weight: bold;"&gt;r&lt;sub&gt;ui&lt;/sub&gt;(3) - [user x time(user) effect]&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;To update, the rating table, we do the following:&lt;br /&gt;&lt;br /&gt;&lt;pre name="code" class="sql"&gt;&lt;br /&gt;&lt;br /&gt;update rating r set residual = residual - (select theta_u*(sqrt(datediff(date,first)) - avg) from user_time_user_effect where userid=r.userid);&lt;br /&gt;&lt;br /&gt;alter table user_time_user_effect engine=myisam;&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-size:130%;"&gt;&lt;span style="font-weight: bold;"&gt;User x Time (movie)&lt;/span&gt;&lt;sup style="font-weight: bold;"&gt;1/2&lt;/sup&gt;&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;The formula here is:&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;x&lt;/span&gt;&lt;sub style="font-weight: bold;"&gt;ui&lt;/sub&gt;&lt;span style="font-weight: bold;"&gt; = √&lt;/span&gt;&lt;span style="text-decoration: overline; font-weight: bold;"&gt;days&lt;/span&gt;&lt;span style="font-weight: bold;"&gt; - avg(√&lt;/span&gt;&lt;span style="text-decoration: overline; font-weight: bold;"&gt;days&lt;/span&gt;&lt;span style="font-weight: bold;"&gt;)&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://3.bp.blogspot.com/_wqJeUjTB5sE/SCpvIaJ5ImI/AAAAAAAABCk/g3xo2pbG9-4/s1600-h/globalfix2.png"&gt;&lt;img style="cursor: pointer;" src="http://3.bp.blogspot.com/_wqJeUjTB5sE/SCpvIaJ5ImI/AAAAAAAABCk/g3xo2pbG9-4/s400/globalfix2.png" alt="" id="BLOGGER_PHOTO_ID_5200090910003110498" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://3.bp.blogspot.com/_wqJeUjTB5sE/SCpwWaJ5IpI/AAAAAAAABC8/pcFcrE-zRb8/s1600-h/globalfix5.png"&gt;&lt;img style="cursor: pointer;" src="http://3.bp.blogspot.com/_wqJeUjTB5sE/SCpwWaJ5IpI/AAAAAAAABC8/pcFcrE-zRb8/s400/globalfix5.png" alt="" id="BLOGGER_PHOTO_ID_5200092250032906898" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;user x time (user) effect =  θ&lt;sub&gt;u&lt;/sub&gt;*&lt;/span&gt;[&lt;span style="font-weight: bold;"&gt;√&lt;/span&gt;&lt;span style="text-decoration: overline; font-weight: bold;"&gt;days&lt;/span&gt;&lt;span style="font-weight: bold;"&gt; - avg(√&lt;/span&gt;&lt;span style="text-decoration: overline; font-weight: bold;"&gt;days&lt;/span&gt;&lt;span style="font-weight: bold;"&gt;)&lt;/span&gt;]&lt;br /&gt;&lt;br /&gt;where &lt;span style="font-weight: bold;"&gt;√&lt;/span&gt;&lt;span style="text-decoration: overline; font-weight: bold;"&gt;days&lt;/span&gt;&lt;span style="font-weight: bold;"&gt; = sqrt(numOfDaysBetween(date,min(date)&lt;/span&gt; for movie))&lt;br /&gt;&lt;br /&gt;Here's the sql:&lt;br /&gt;&lt;br /&gt;(1)  Populate the "first" date field of tables: rating and probe&lt;br /&gt;&lt;br /&gt;&lt;pre name="code" class="sql"&gt;&lt;br /&gt;create table movie_first_rating as select movieid,min(date) "first" from rating group by movieid;&lt;br /&gt;&lt;br /&gt;create index index_movie_first_rating_movieid on movie_first_rating(movieid);&lt;br /&gt;&lt;br /&gt;alter table movie_first_rating engine=memory;&lt;br /&gt;&lt;br /&gt;update rating r set first = (select first from movie_first_rating where movieid = r.movieid);&lt;br /&gt;&lt;br /&gt;update probe p set first = (select first from movie_first_rating where movieid = p.movieid);&lt;br /&gt;&lt;br /&gt;alter table movie_first_rating engine=myisam;&lt;br /&gt;&lt;br /&gt;update probe p set first = date where datediff(date,first) &amp;lt; 0;&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;&lt;br /&gt;(2)  Populate the "avg" field of rating and probe&lt;br /&gt;&lt;br /&gt;&lt;pre name="code" class="sql"&gt;&lt;br /&gt;create table user_avg_num_days_for_movie (userid int(3), avg double);&lt;br /&gt;&lt;br /&gt;insert into user_avg_num_days_for_movie (userid,avg) select userid,avg(sqrt(datediff(date,first))) from rating group by userid;&lt;br /&gt;&lt;br /&gt;create index index_user_avg_num_days_for_movie on user_avg_num_days_for_movie(userid);&lt;br /&gt;&lt;br /&gt;alter table user_avg_num_days_for_movie engine=memory;&lt;br /&gt;&lt;br /&gt;update rating r set avg = (select avg from user_avg_num_days_for_movie where userid = r.userid);&lt;br /&gt;&lt;br /&gt;update probe p set avg = (select avg from user_avg_num_days_for_movie where userid=p.userid);&lt;br /&gt;&lt;br /&gt;alter table user_avg_num_days_for_movie engine=myisam;&lt;br /&gt;&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;&lt;br /&gt;(3)  Compute user_time_movie_effect&lt;br /&gt;&lt;br /&gt;&lt;pre name="code" class="sql"&gt;&lt;br /&gt;create table user_time_movie_effect (userid int(3), sum_r_ui_x_ui double, sum_x_ui_squared double, n_u int(3), theta_u_hat double, theta_u double);&lt;br /&gt;&lt;br /&gt;insert into user_time_movie_effect (userid,sum_r_ui_x_ui,sum_x_ui_squared,n_u) select userid,sum(residual*(sqrt(datediff(date,first)) - avg)),sum((sqrt(datediff(date,first)) - avg)*(sqrt(datediff(date,first))-avg)),count(residual) from rating group by userid;&lt;br /&gt;&lt;br /&gt;update user_time_movie_effect set theta_u_hat = sum_r_ui_x_ui/sum_x_ui_squared;&lt;br /&gt;&lt;br /&gt;update user_time_movie_effect set theta_u = (n_u*theta_u_hat)/(n_u + 150);&lt;br /&gt;&lt;br /&gt;update user_time_movie_effect set theta_u = 0 where theta_u is null;&lt;br /&gt;&lt;br /&gt;create index index_user_time_movie_effect on user_time_movie_effect(userid);&lt;br /&gt;&lt;br /&gt;alter table user_time_movie_effect engine=memory;&lt;br /&gt;&lt;br /&gt;update probe p set prediction = prediction + (select theta_u*(sqrt(datediff(date,first)) - avg) from user_time_movie_effect where userid=p.userid);&lt;br /&gt;&lt;br /&gt;update probe set prediction = 5 where prediction &amp;gt; 5;&lt;br /&gt;&lt;br /&gt;update probe set prediction = 1 where prediction &amp;lt; 1;&lt;br /&gt;&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;&lt;br /&gt;The RMSE I get here is:  &lt;span style="font-weight: bold;"&gt;0.9785&lt;/span&gt;.    That's a match to the BellKor results.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;r&lt;/span&gt;&lt;sub style="font-weight: bold;"&gt;ui&lt;/sub&gt;&lt;span style="font-weight: bold;"&gt;(5) = &lt;/span&gt;&lt;span style="font-weight: bold;"&gt;r&lt;sub&gt;ui&lt;/sub&gt;(4) - [user x time(movie) effect]&lt;br /&gt;&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;To update, the rating table, we do the following:&lt;br /&gt;&lt;br /&gt;&lt;pre name="code" class="sql"&gt;&lt;br /&gt;&lt;br /&gt;update rating r set residual = residual - (select theta_u*(sqrt(datediff(date,first)) - avg) from user_time_movie_effect where userid=r.userid);&lt;br /&gt;&lt;br /&gt;alter table user_time_movie_effect engine=myisam;&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-size:130%;"&gt;&lt;span style="font-weight: bold;"&gt;Movie x Time (movie)&lt;/span&gt;&lt;sup style="font-weight: bold;"&gt;1/2&lt;/sup&gt;&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;The formula here is:&lt;br /&gt;&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://2.bp.blogspot.com/_wqJeUjTB5sE/SCpviKJ5InI/AAAAAAAABCs/5kELQEHslAg/s1600-h/globalfix3.png"&gt;&lt;img style="cursor: pointer;" src="http://2.bp.blogspot.com/_wqJeUjTB5sE/SCpviKJ5InI/AAAAAAAABCs/5kELQEHslAg/s400/globalfix3.png" alt="" id="BLOGGER_PHOTO_ID_5200091352384742002" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://4.bp.blogspot.com/_wqJeUjTB5sE/SCpprqJ5IhI/AAAAAAAABB8/bXimefNaBbo/s1600-h/global11.png"&gt;&lt;img style="cursor: pointer;" src="http://4.bp.blogspot.com/_wqJeUjTB5sE/SCpprqJ5IhI/AAAAAAAABB8/bXimefNaBbo/s400/global11.png" alt="" id="BLOGGER_PHOTO_ID_5200084918523732498" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;where &lt;span style="font-weight: bold;"&gt;√&lt;/span&gt;&lt;span style="text-decoration: overline; font-weight: bold;"&gt;days&lt;/span&gt;&lt;span style="font-weight: bold;"&gt; = sqrt(numOfDaysBetween(date,min(date)&lt;/span&gt; for &lt;span style="font-weight: bold;"&gt;movie))&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;Here's the sql:&lt;br /&gt;&lt;br /&gt;(1)  Populate the "first" date field of tables: rating and probe&lt;br /&gt;&lt;br /&gt;&lt;pre name="code" class="sql"&gt;&lt;br /&gt;alter table movie_first_rating engine=memory;&lt;br /&gt;&lt;br /&gt;update rating r set first = (select first from movie_first_rating where movieid = r.movieid);&lt;br /&gt;&lt;br /&gt;update probe p set first = (select first from movie_first_rating where movieid = p.movieid);&lt;br /&gt;&lt;br /&gt;update probe set first = date where datediff(date,first) &amp;lt; 0;&lt;br /&gt;&lt;br /&gt;alter table movie_first_rating engine=myisam;&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;&lt;br /&gt;(2)  Populate the "avg" field of rating and probe&lt;br /&gt;&lt;br /&gt;&lt;pre name="code" class="sql"&gt;&lt;br /&gt;create table movie_avg_num_days_for_movie (movieid int(2), avg double);&lt;br /&gt;&lt;br /&gt;insert into movie_avg_num_days_for_movie (movieid,avg) select movieid,avg(sqrt(datediff(date,first))) from rating group by movieid;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;create index index_movie_avg_num_days_for_movie on movie_avg_num_days_for_movie(movieid);&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;alter table movie_avg_num_days_for_movie engine=memory;&lt;br /&gt;&lt;br /&gt;update rating r set avg = (select avg from movie_avg_num_days_for_movie where movieid = r.movieid);&lt;br /&gt;&lt;br /&gt;update probe p set avg = (select avg from movie_avg_num_days_for_movie where movieid=p.movieid);&lt;br /&gt;&lt;br /&gt;alter table movie_avg_num_days_for_movie engine=myisam;&lt;br /&gt;&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;&lt;br /&gt;(3)  Compute movie_time_movie_effect&lt;br /&gt;&lt;br /&gt;&lt;pre name="code" class="sql"&gt;&lt;br /&gt;create table movie_time_movie_effect (movieid int(2), sum_r_ui_x_ui double, sum_x_ui_squared double, n_i int(3), theta_i_hat double, theta_i double);&lt;br /&gt;&lt;br /&gt;insert into movie_time_movie_effect (movieid,sum_r_ui_x_ui,sum_x_ui_squared,n_i) select movieid,sum(residual*(sqrt(datediff(date,first)) - avg)),sum((sqrt(datediff(date,first)) - avg)*(sqrt(datediff(date,first))-avg)),count(residual) from rating group by movieid;&lt;br /&gt;&lt;br /&gt;update movie_time_movie_effect set theta_i_hat = sum_r_ui_x_ui/sum_x_ui_squared;&lt;br /&gt;&lt;br /&gt;update movie_time_movie_effect set theta_i = (n_i*theta_i_hat)/(n_i + 4000);&lt;br /&gt;&lt;br /&gt;update movie_time_movie_effect set theta_i = 0 where theta_i is null;&lt;br /&gt;&lt;br /&gt;create index index_movie_time_movie_effect on movie_time_movie_effect(movieid);&lt;br /&gt;&lt;br /&gt;alter table movie_time_movie_effect engine=memory;&lt;br /&gt;&lt;br /&gt;update probe p set prediction = prediction + (select theta_i*(sqrt(datediff(date,first)) - avg) from movie_time_movie_effect where movieid=p.movieid);&lt;br /&gt;&lt;br /&gt;update probe set prediction = 5 where prediction &amp;gt; 5;&lt;br /&gt;&lt;br /&gt;update probe set prediction = 1 where prediction &amp;lt; 1;&lt;br /&gt;&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;&lt;br /&gt;The RMSE I get here is:   &lt;span style="font-weight: bold;"&gt;0.9766&lt;/span&gt;.  That's a match to the BellKor results.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;r&lt;/span&gt;&lt;sub style="font-weight: bold;"&gt;ui&lt;/sub&gt;&lt;span style="font-weight: bold;"&gt;(6) = &lt;/span&gt;&lt;span style="font-weight: bold;"&gt;r&lt;sub&gt;ui&lt;/sub&gt;(5) - [movie x time(movie) effect]&lt;br /&gt;&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;To update, the rating table, we do the following:&lt;br /&gt;&lt;br /&gt;&lt;pre name="code" class="sql"&gt;&lt;br /&gt;&lt;br /&gt;update rating r set residual = residual - (select theta_i*(sqrt(datediff(date,first)) - avg) from movie_time_movie_effect where movieid=r.movieid);&lt;br /&gt;&lt;br /&gt;alter table movie_time_movie_effect engine=myisam;&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-size:130%;"&gt;&lt;span style="font-weight: bold;"&gt;Movie x Time (user)&lt;/span&gt;&lt;sup style="font-weight: bold;"&gt;1/2&lt;/sup&gt;&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;The formula here is:&lt;br /&gt;&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://1.bp.blogspot.com/_wqJeUjTB5sE/SCpv56J5IoI/AAAAAAAABC0/RuxO7P6ydiY/s1600-h/globalfix4.png"&gt;&lt;img style="cursor: pointer;" src="http://1.bp.blogspot.com/_wqJeUjTB5sE/SCpv56J5IoI/AAAAAAAABC0/RuxO7P6ydiY/s400/globalfix4.png" alt="" id="BLOGGER_PHOTO_ID_5200091760406635138" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://4.bp.blogspot.com/_wqJeUjTB5sE/SCpqvqJ5IjI/AAAAAAAABCM/9KHEIE6Hyl4/s1600-h/global13.png"&gt;&lt;img style="cursor: pointer;" src="http://4.bp.blogspot.com/_wqJeUjTB5sE/SCpqvqJ5IjI/AAAAAAAABCM/9KHEIE6Hyl4/s400/global13.png" alt="" id="BLOGGER_PHOTO_ID_5200086086754837042" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;where&lt;span style="font-weight: bold;"&gt; √&lt;/span&gt;&lt;span style="text-decoration: overline; font-weight: bold;"&gt;days&lt;/span&gt;&lt;span style="font-weight: bold;"&gt; = sqrt(numOfDaysBetween(date,min(date)&lt;/span&gt; for &lt;span style="font-weight: bold;"&gt;user))&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;Here's the sql:&lt;br /&gt;&lt;br /&gt;(1)  Populate the "first" date field of tables: rating and probe&lt;br /&gt;&lt;br /&gt;&lt;pre name="code" class="sql"&gt;&lt;br /&gt;alter table user_first_rating engine=memory;&lt;br /&gt;&lt;br /&gt;update rating r set first = (select first from user_first_rating where userid = r.userid);&lt;br /&gt;&lt;br /&gt;update probe p set first = (select first from user_first_rating where userid = p.userid);&lt;br /&gt;&lt;br /&gt;update probe set first = date where datediff(date,first) &amp;lt; 0;&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;&lt;br /&gt;(2)  Populate "avg" field of tables: rating and probe.&lt;br /&gt;&lt;br /&gt;&lt;pre name="code" class="sql"&gt;&lt;br /&gt;create table movie_avg_num_days_for_user (movieid int(2), avg double);&lt;br /&gt;&lt;br /&gt;insert into movie_avg_num_days_for_user (movieid,avg) select movieid,avg(sqrt(datediff(date,first))) from rating group by movieid;&lt;br /&gt;&lt;br /&gt;create index index_movie_avg_num_days_for_user_movieid on movie_avg_num_days_for_user(movieid);&lt;br /&gt;&lt;br /&gt;alter table movie_avg_num_days_for_user engine=memory;&lt;br /&gt;&lt;br /&gt;update rating r set avg = (select avg from movie_avg_num_days_for_user where movieid = r.movieid);&lt;br /&gt;&lt;br /&gt;update probe p set avg = (select avg from movie_avg_num_days_for_user where movieid = p.movieid);&lt;br /&gt;&lt;br /&gt;alter table movie_avg_num_days_for_user engine=myisam;&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;&lt;br /&gt;(3)  Compute movie_time_user_effect&lt;br /&gt;&lt;br /&gt;&lt;pre name="code" class="sql"&gt;&lt;br /&gt;create table movie_time_user_effect (movieid int(2), sum_r_ui_x_ui double, sum_x_ui_squared double, n_i int(3), theta_i_hat double, theta_i double);&lt;br /&gt;&lt;br /&gt;insert into movie_time_user_effect (movieid,sum_r_ui_x_ui,sum_x_ui_squared,n_i) select movieid,sum(residual*(sqrt(datediff(date,first)) - avg)),sum((sqrt(datediff(date,first)) - avg)*(sqrt(datediff(date,first))-avg)),count(rating) from rating group by movieid;&lt;br /&gt;&lt;br /&gt;update movie_time_user_effect set theta_i_hat = sum_r_ui_x_ui/sum_x_ui_squared;&lt;br /&gt;&lt;br /&gt;update movie_time_user_effect set theta_i = (n_i*theta_i_hat)/(n_i + 500);&lt;br /&gt;&lt;br /&gt;create index index_movie_time_user_effect_movieid on movie_time_user_effect(movieid);&lt;br /&gt;&lt;br /&gt;alter table movie_time_user_effect engine=memory;&lt;br /&gt;&lt;br /&gt;update probe p set prediction = prediction + (select theta_i*(sqrt(datediff(date,first)) - avg) from movie_time_user_effect where movieid = p.movieid);&lt;br /&gt;&lt;br /&gt;update probe set prediction = 5 where prediction &amp;gt; 5;&lt;br /&gt;&lt;br /&gt;update probe set prediction = 1 where prediction &amp;lt; 1;&lt;br /&gt;&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;&lt;br /&gt;The RMSE I get here is:  &lt;span style="font-weight: bold;"&gt;0.9758.&lt;/span&gt;  That's a match to the BellKor results.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;r&lt;/span&gt;&lt;sub style="font-weight: bold;"&gt;ui&lt;/sub&gt;&lt;span style="font-weight: bold;"&gt;(7) = &lt;/span&gt;&lt;span style="font-weight: bold;"&gt;r&lt;sub&gt;ui&lt;/sub&gt;(6) - [movie x time(user) effect]&lt;br /&gt;&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;To update, the rating table, we do the following:&lt;br /&gt;&lt;br /&gt;&lt;pre name="code" class="sql"&gt;&lt;br /&gt;&lt;br /&gt;update rating r set residual = residual - (select theta_i*(sqrt(datediff(date,first)) - avg) from movie_time_user_effect where movieid=r.movieid);&lt;br /&gt;&lt;br /&gt;alter table movie_time_user_effect engine=myisam;&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-size:130%;"&gt;&lt;span style="font-weight: bold;"&gt;User x Movie average&lt;/span&gt;&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;The formula here is:&lt;br /&gt;&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://2.bp.blogspot.com/_wqJeUjTB5sE/SCpw9KJ5IqI/AAAAAAAABDE/9hm17PH9xDw/s1600-h/global14.png"&gt;&lt;img style="cursor: pointer;" src="http://2.bp.blogspot.com/_wqJeUjTB5sE/SCpw9KJ5IqI/AAAAAAAABDE/9hm17PH9xDw/s400/global14.png" alt="" id="BLOGGER_PHOTO_ID_5200092915752837794" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://1.bp.blogspot.com/_wqJeUjTB5sE/SCpxU6J5IrI/AAAAAAAABDM/e0eyG0PltjM/s1600-h/global15.png"&gt;&lt;img style="cursor: pointer;" src="http://1.bp.blogspot.com/_wqJeUjTB5sE/SCpxU6J5IrI/AAAAAAAABDM/e0eyG0PltjM/s400/global15.png" alt="" id="BLOGGER_PHOTO_ID_5200093323774730930" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;Here's the sql:&lt;br /&gt;&lt;br /&gt;(1)  Populate the "avg" field of tables: rating and probe&lt;br /&gt;&lt;br /&gt;&lt;pre name="code" class="sql"&gt;&lt;br /&gt;create table movie_average (movieid int(2), avg double);&lt;br /&gt;&lt;br /&gt;insert into movie_average (movieid,avg) select movieid,avg(rating) from rating group by movieid;&lt;br /&gt;&lt;br /&gt;create index index_movie_average_movieid on movie_average(movieid);&lt;br /&gt;&lt;br /&gt;alter table movie_average engine=memory;&lt;br /&gt;&lt;br /&gt;update rating r set avg = (select avg from movie_average where movieid = r.movieid);&lt;br /&gt;&lt;br /&gt;update probe p set avg = (select avg from movie_average where movieid = p.movieid);&lt;br /&gt;&lt;br /&gt;alter table movie_average engine=myisam;&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;&lt;br /&gt;(2)  Compute user_movie_average_effect&lt;br /&gt;&lt;br /&gt;&lt;pre name="code" class="sql"&gt;&lt;br /&gt;create table user_movie_average_effect (userid int(3), sum_r_ui_x_ui double, sum_x_ui_squared double, n_u int(3), theta_u_hat double, theta_u double);&lt;br /&gt;&lt;br /&gt;insert into user_movie_average_effect (userid,sum_r_ui_x_ui,sum_x_ui_squared,n_u) select userid,sum(residual*(avg - 3.6033)),sum((avg - 3.6033)*(avg - 3.6033)),count(rating) from rating group by userid;&lt;br /&gt;&lt;br /&gt;update user_movie_average_effect set theta_u_hat = sum_r_ui_x_ui/sum_x_ui_squared;&lt;br /&gt;&lt;br /&gt;update user_movie_average_effect set theta_u = (n_u*theta_u_hat)/(n_u + 550);&lt;br /&gt;&lt;br /&gt;update user_movie_average_effect set theta_u = 0 where theta_u is null;&lt;br /&gt;&lt;br /&gt;create index user_movie_average_effect_userid on user_movie_average_effect(userid);&lt;br /&gt;&lt;br /&gt;alter table user_movie_average_effect engine=memory;&lt;br /&gt;&lt;br /&gt;update probe p set prediction = prediction + (select theta_u*(avg - 3.6033) from user_movie_average_effect where userid=p.userid);&lt;br /&gt;&lt;br /&gt;update probe set prediction = 5 where prediction &amp;gt; 5;&lt;br /&gt;&lt;br /&gt;update probe set prediction = 1 where prediction &amp;lt; 1;&lt;br /&gt;&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;&lt;br /&gt;The RMSE I get here is:  &lt;span style="font-weight: bold;"&gt;0.9720.&lt;/span&gt;  That's not an exact match but pretty close to the BellKor reported result of &lt;span style="font-weight: bold;"&gt;RMSE = 0.9719&lt;/span&gt;.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;r&lt;/span&gt;&lt;sub style="font-weight: bold;"&gt;ui&lt;/sub&gt;&lt;span style="font-weight: bold;"&gt;(8) = &lt;/span&gt;&lt;span style="font-weight: bold;"&gt;r&lt;sub&gt;ui&lt;/sub&gt;(7) -  [user x  movie  average  effect] &lt;/span&gt;&lt;br /&gt;&lt;br /&gt;To update, the rating table, we do the following:&lt;br /&gt;&lt;br /&gt;&lt;pre name="code" class="sql"&gt;&lt;br /&gt;&lt;br /&gt;update rating r set residual = residual - (select theta_u*(avg - 3.6033) from user_movie_average_effect where userid=r.userid);&lt;br /&gt;&lt;br /&gt;alter table user_movie_average_effect engine=myisam;&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-size:130%;"&gt;&lt;span style="font-weight: bold;"&gt;User x Movie support&lt;/span&gt;&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;The formula here is:&lt;br /&gt;&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://2.bp.blogspot.com/_wqJeUjTB5sE/SD-qVBdgctI/AAAAAAAABE8/aSgTcrfeO5s/s1600-h/replace4.png"&gt;&lt;img style="cursor: pointer;" src="http://2.bp.blogspot.com/_wqJeUjTB5sE/SD-qVBdgctI/AAAAAAAABE8/aSgTcrfeO5s/s400/replace4.png" alt="" id="BLOGGER_PHOTO_ID_5206066972409754322" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://1.bp.blogspot.com/_wqJeUjTB5sE/SCpxU6J5IrI/AAAAAAAABDM/e0eyG0PltjM/s1600-h/global15.png"&gt;&lt;img style="cursor: pointer;" src="http://1.bp.blogspot.com/_wqJeUjTB5sE/SCpxU6J5IrI/AAAAAAAABDM/e0eyG0PltjM/s400/global15.png" alt="" id="BLOGGER_PHOTO_ID_5200093323774730930" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;where support = the number of ratings for a given movie.&lt;br /&gt;&lt;br /&gt;Here's the sql:&lt;br /&gt;&lt;br /&gt;(1)  Populate the "support"  field of tables: rating and probe&lt;br /&gt;&lt;br /&gt;&lt;pre name="code" class="sql"&gt;&lt;br /&gt;create table movie_support (movieid int(2), support int(3));&lt;br /&gt;&lt;br /&gt;insert into movie_support(movieid,support) select movieid,count(rating) from rating group by movieid;&lt;br /&gt;&lt;br /&gt;create index index_movie_support_movieid on movie_support(movieid);&lt;br /&gt;&lt;br /&gt;alter table movie_support engine=memory;&lt;br /&gt;&lt;br /&gt;update rating r set support = (select support from movie_support where movieid = r.movieid);&lt;br /&gt;&lt;br /&gt;update probe p set support = (select support from movie_support where movieid = p.movieid);&lt;br /&gt;&lt;br /&gt;alter table movie_support engine=myisam;&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;&lt;br /&gt;(2)  Populate avg field of tables: rating and probe&lt;br /&gt;&lt;br /&gt;&lt;pre name="code" class="sql"&gt;&lt;br /&gt;&lt;br /&gt;create table average_movie_support_for_user (userid int(3), avg double);&lt;br /&gt;&lt;br /&gt;insert into average_movie_support_for_user(userid,avg) select userid,avg(sqrt(support)) from rating group by userid;&lt;br /&gt;&lt;br /&gt;create index index_average_movie_support_for_user_userid on average_movie_support_for_user(userid);&lt;br /&gt;&lt;br /&gt;alter table average_movie_support_for_user engine=memory;&lt;br /&gt;&lt;br /&gt;update rating r set avg = (select avg from average_movie_support_for_user where userid = r.userid);&lt;br /&gt;&lt;br /&gt;update probe p set avg = (select avg from average_movie_support_for_user where userid = p.userid);&lt;br /&gt;&lt;br /&gt;alter table average_movie_support_for_user engine=myisam;&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;&lt;br /&gt;(3)  Compute user_movie_support_effect&lt;br /&gt;&lt;br /&gt;&lt;pre name="code" class="sql"&gt;&lt;br /&gt;create table user_movie_support_effect (userid int(3), sum_r_ui_x_ui double, sum_x_ui_squared double, n_u int(3), theta_u_hat double, theta_u double);&lt;br /&gt;&lt;br /&gt;insert into user_movie_support_ effect (userid,sum_r_ui_x_ui,sum_x_ui_squared,n_u) select userid,sum(residual*(sqrt(support) - avg),sum((sqrt(support) - avg)*(sqrt(support) - avg)),count(rating) from rating group by userid;&lt;br /&gt;&lt;br /&gt;update user_movie_support_effect set theta_u_hat = sum_r_ui_x_ui/sum_x_ui_squared;&lt;br /&gt;&lt;br /&gt;update user_movie_support_effect set theta_u = (n_u*theta_u_hat)/(n_u + 90);&lt;br /&gt;&lt;br /&gt;update user_movie_support_effect set theta_u = 0 where theta_u is null;&lt;br /&gt;&lt;br /&gt;create index index_user_movie_support_effect_userid on user_movie_support_effect(userid);&lt;br /&gt;&lt;br /&gt;alter table user_movie_support_effect engine=memory;&lt;br /&gt;&lt;br /&gt;update probe p set prediction = prediction + (select theta_u*(sqrt(support) - avg) from user_movie_support_effect where userid = p.userid);&lt;br /&gt;&lt;br /&gt;update probe set prediction = 5 where prediction &amp;gt; 5;&lt;br /&gt;&lt;br /&gt;update probe set prediction = 1 where prediction &amp;lt; 1;&lt;br /&gt;&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;&lt;br /&gt;The RMSE I get here is:  &lt;span style="font-weight: bold;"&gt;0.9691.&lt;/span&gt;  Still, extremely close to the BellKor reported result of &lt;span style="font-weight: bold;"&gt;RMSE = 0.9690&lt;/span&gt;.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;r&lt;/span&gt;&lt;sub style="font-weight: bold;"&gt;ui&lt;/sub&gt;&lt;span style="font-weight: bold;"&gt;(9) = &lt;/span&gt;&lt;span style="font-weight: bold;"&gt;r&lt;sub&gt;ui&lt;/sub&gt;(8) - [user x movie support effect]&lt;br /&gt;&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;To update, the rating table, we do the following:&lt;br /&gt;&lt;br /&gt;&lt;pre name="code" class="sql"&gt;&lt;br /&gt;&lt;br /&gt;update rating r set residual = residual - (select theta_u*(sqrt(support) - avg) from user_movie_support_effect where userid=r.userid);&lt;br /&gt;&lt;br /&gt;alter table user_movie_support_effect engine=myisam;&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-size:130%;"&gt;&lt;span style="font-weight: bold;"&gt;Movie x User average &lt;/span&gt;&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;The formula here is:&lt;br /&gt;&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://2.bp.blogspot.com/_wqJeUjTB5sE/SDz9yxdgcrI/AAAAAAAABEs/MzxtOk9Efo4/s1600-h/replace2.png"&gt;&lt;img style="cursor: pointer;" src="http://2.bp.blogspot.com/_wqJeUjTB5sE/SDz9yxdgcrI/AAAAAAAABEs/MzxtOk9Efo4/s400/replace2.png" alt="" id="BLOGGER_PHOTO_ID_5205314318045835954" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://4.bp.blogspot.com/_wqJeUjTB5sE/SCpzVqJ5IuI/AAAAAAAABDk/m9svV4nw9Bc/s1600-h/global18.png"&gt;&lt;img style="cursor: pointer;" src="http://4.bp.blogspot.com/_wqJeUjTB5sE/SCpzVqJ5IuI/AAAAAAAABDk/m9svV4nw9Bc/s400/global18.png" alt="" id="BLOGGER_PHOTO_ID_5200095535682888418" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;Here's the sql:&lt;br /&gt;&lt;br /&gt;(1)  Populate the "avg" field of tables: rating and probe&lt;br /&gt;&lt;br /&gt;&lt;pre name="code" class="sql"&gt;&lt;br /&gt;create table user_average (userid int(3), avg double);&lt;br /&gt;&lt;br /&gt;insert into user_average (userid,avg) select userid,avg(rating) from rating group by userid;&lt;br /&gt;&lt;br /&gt;create index index_user_average_userid on user_average(userid);&lt;br /&gt;&lt;br /&gt;alter table user_average engine=memory;&lt;br /&gt;&lt;br /&gt;update rating r set avg = (select avg from user_average where userid = r.userid);&lt;br /&gt;&lt;br /&gt;update probe p set avg = (select avg from user_average where userid = p.userid);&lt;br /&gt;&lt;br /&gt;alter table user_average engine=myisam;&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;(2)  Populate "avg2" field of tables: rating and probe&lt;br /&gt;&lt;br /&gt;&lt;pre name="code" class="sql"&gt;&lt;br /&gt;alter table movie_average engine=memory;&lt;br /&gt;&lt;br /&gt;update rating r set avg2 = (select avg from movie_average where movieid = r.movieid);&lt;br /&gt;&lt;br /&gt;update probe p set avg2 = (select avg from movie_average where movieid = p.movieid);&lt;br /&gt;&lt;br /&gt;alter table movie_average engine=myisam;&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;&lt;br /&gt;(3)  Compute movie_user_average_effect&lt;br /&gt;&lt;br /&gt;&lt;pre name="code" class="sql"&gt;&lt;br /&gt;&lt;br /&gt;create table movie_user_average_effect (movieid int(2), sum_r_ui_x_ui double, sum_x_ui_squared double, n_i int(3), theta_i_hat double, theta_i double);&lt;br /&gt;&lt;br /&gt;insert into movie_user_average_effect (movieid,sum_r_ui_x_ui,sum_x_ui_squared,n_i) select movieid,sum(residual*(avg - avg2),sum((avg - avg2)*(avg - avg2)),count(rating) from rating group by movieid;&lt;br /&gt;&lt;br /&gt;update movie_user_average_effect set theta_i_hat = sum_r_ui_x_ui/sum_x_ui_squared;&lt;br /&gt;&lt;br /&gt;update movie_user_average_effect set theta_i = (n_i*theta_i_hat)/(n_i + 50);&lt;br /&gt;&lt;br /&gt;create index index_movie_user_average_effect_movieid on movie_user_average_effect(movieid);&lt;br /&gt;&lt;br /&gt;alter table movie_user_average_effect engine=memory;&lt;br /&gt;&lt;br /&gt;update probe p set prediction = prediction + (select theta_i*(avg - avg2) from memory_movie_user_average_effect where movieid = p.movieid);&lt;br /&gt;&lt;br /&gt;update probe set prediction = 5 where prediction &amp;gt; 5;&lt;br /&gt;&lt;br /&gt;update probe set prediction = 1 where prediction &amp;lt; 1;&lt;br /&gt;&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;&lt;br /&gt;The RMSE I get here is:  &lt;span style="font-weight: bold;"&gt;0.9680.  &lt;/span&gt;While I caught an effect, my result is different than BellKor's reported result of &lt;span style="font-weight: bold;"&gt;RMSE = 0.9670.&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;r&lt;/span&gt;&lt;sub style="font-weight: bold;"&gt;ui&lt;/sub&gt;&lt;span style="font-weight: bold;"&gt;(10) = &lt;/span&gt;&lt;span style="font-weight: bold;"&gt;r&lt;sub&gt;ui&lt;/sub&gt;(9) -  [movie x user average effect] &lt;/span&gt;&lt;br /&gt;&lt;span style="font-size:130%;"&gt;&lt;span style="font-weight: bold;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;To update, the rating table, we do the following:&lt;br /&gt;&lt;br /&gt;&lt;pre name="code" class="sql"&gt;&lt;br /&gt;&lt;br /&gt;update rating r set residual = residual - (select theta_i*(avg - avg2) from movie_user_average_effect where movie=r.movieid);&lt;br /&gt;&lt;br /&gt;alter table movie_user_average_effect engine=myisam;&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-size:130%;"&gt;&lt;span style="font-weight: bold;"&gt;Movie x User support&lt;/span&gt;&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;The formula here is:&lt;br /&gt;&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://4.bp.blogspot.com/_wqJeUjTB5sE/SECG2n7TJ_I/AAAAAAAABFE/ch6iCOHQoiQ/s1600-h/replace5.png"&gt;&lt;img style="cursor: pointer;" src="http://4.bp.blogspot.com/_wqJeUjTB5sE/SECG2n7TJ_I/AAAAAAAABFE/ch6iCOHQoiQ/s400/replace5.png" alt="" id="BLOGGER_PHOTO_ID_5206309442229053426" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://4.bp.blogspot.com/_wqJeUjTB5sE/SCpzVqJ5IuI/AAAAAAAABDk/m9svV4nw9Bc/s1600-h/global18.png"&gt;&lt;img style="cursor: pointer;" src="http://4.bp.blogspot.com/_wqJeUjTB5sE/SCpzVqJ5IuI/AAAAAAAABDk/m9svV4nw9Bc/s400/global18.png" alt="" id="BLOGGER_PHOTO_ID_5200095535682888418" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;where support = the number rating made by a given user.&lt;br /&gt;&lt;br /&gt;Here is the sql:&lt;br /&gt;&lt;br /&gt;(1)  Populate the "support" field of tables: rating and probe&lt;br /&gt;&lt;br /&gt;&lt;pre name="code" class="sql"&gt;&lt;br /&gt;create table user_support (userid int(3), support int(3));&lt;br /&gt;&lt;br /&gt;insert into user_support(userid,support) select userid,count(rating) from rating group by userid;&lt;br /&gt;&lt;br /&gt;create index index_user_support_userid on user_support(userid);&lt;br /&gt;&lt;br /&gt;alter table user_support engine=memory;&lt;br /&gt;&lt;br /&gt;update rating r set support = (select support from user_support where userid=r.userid);&lt;br /&gt;&lt;br /&gt;update probe p set support = (select support from user_support where userid=p.userid);&lt;br /&gt;&lt;br /&gt;alter table user_support engine=myisam;&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;&lt;br /&gt;(2)  Populate "avg" field of tables: rating and probe&lt;br /&gt;&lt;br /&gt;&lt;pre name="code" class="sql"&gt;&lt;br /&gt;create table average_user_support_for_movie (movieid int(2), avg double);&lt;br /&gt;&lt;br /&gt;insert into average_user_support_for_movie(movieid,avg) select movieid,avg(support) from rating group by movieid;&lt;br /&gt;&lt;br /&gt;create index index_average_user_support_for_movie_movieid on average_user_support_for_movie(movieid);&lt;br /&gt;&lt;br /&gt;alter table average_user_support_for_movie engine=memory;&lt;br /&gt;&lt;br /&gt;update rating r set avg = (select avg from average_user_support_for_movie where movieid = r.movieid);&lt;br /&gt;&lt;br /&gt;update probe p set avg = (select avg from average_user_support_for_movie where movieid = p.movieid);&lt;br /&gt;&lt;br /&gt;alter table average_user_support_for_movie engine=myisam;&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;&lt;br /&gt;(3)  Compute movie_user_support_effect&lt;br /&gt;&lt;br /&gt;&lt;pre name="code" class="sql"&gt;&lt;br /&gt;create table movie_user_support_effect (movieid int(2), sum_r_ui_x_ui double, sum_x_ui_squared double, n_i int(3), theta_i_hat double, theta_i double);&lt;br /&gt;&lt;br /&gt;insert into movie_user_support_effect (movieid,sum_r_ui_x_ui,sum_x_ui_squared,n_i) select movieid,sum(residual*(sqrt(support) - sqrt(avg))),sum((sqrt(support) - sqrt(avg))*(sqrt(support) - sqrt(avg))),count(rating) from rating group by movieid;&lt;br /&gt;&lt;br /&gt;update movie_user_support_effect set theta_i_hat = sum_r_ui_x_ui/sum_x_ui_squared;&lt;br /&gt;&lt;br /&gt;update movie_user_support_effect set theta_i = (n_i*theta_i_hat)/(n_i + 50);&lt;br /&gt;&lt;br /&gt;update movie_user_support_effect set theta_i = 0 where theta_i is null;&lt;br /&gt;&lt;br /&gt;create index movie_user_support_effect_movieid on movie_user_support_effect(movieid);&lt;br /&gt;&lt;br /&gt;alter table movie_user_support_effect engine=memory;&lt;br /&gt;&lt;br /&gt;update probe p set prediction = prediction + (select theta_i*(sqrt(support) - sqrt(avg) from movie_user_support_effect where movieid = p.movieid);&lt;br /&gt;&lt;br /&gt;update probe set prediction = 5 where prediction &amp;gt; 5;&lt;br /&gt;&lt;br /&gt;update probe set prediction = 1 where prediction &amp;lt; 1;&lt;br /&gt;&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;&lt;br /&gt;The RMSE I get here is:  &lt;span style="font-weight: bold;"&gt;0.9658.  &lt;/span&gt;That got me back on track to the BellKor reported result of&lt;span style="font-weight: bold;"&gt; RMSE = 0.9657.&lt;br /&gt;&lt;br /&gt;&lt;/span&gt;&lt;span style="font-weight: bold;"&gt;r&lt;/span&gt;&lt;sub style="font-weight: bold;"&gt;ui&lt;/sub&gt;&lt;span style="font-weight: bold;"&gt;(11) = &lt;/span&gt;&lt;span style="font-weight: bold;"&gt;r&lt;sub&gt;ui&lt;/sub&gt;(10) - [movie x user support effect]&lt;br /&gt;&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;To update, the rating table, we do the following:&lt;br /&gt;&lt;br /&gt;&lt;pre name="code" class="sql"&gt;&lt;br /&gt;&lt;br /&gt;update rating r set residual = residual - (select theta_i*(avg - avg2) from movie_user_average_effect where movie=r.movieid);&lt;br /&gt;&lt;br /&gt;alter table movie_user_average_effect engine=myisam;&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;&lt;br /&gt;Now, the trick is to find any other global effects to see how low you can go.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-size:130%;"&gt;&lt;span style="font-weight: bold;"&gt;Observations from Netflix Forum&lt;/span&gt;&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;There is one particular &lt;a href="http://www.netflixprize.com/community/viewtopic.php?id=847"&gt;thread&lt;/a&gt; on the NetflixPrize forum where the experts have spoken about global effects.&lt;br /&gt;&lt;br /&gt;In this discussion, Gavin Potter (Just a Guy in a Garage ranked #4 at the time of the article) notes that he is able to get to an RMSE of &lt;span style="font-weight: bold;"&gt;0.95&lt;/span&gt; just using global effects. This impressed Yehuda Koren of the BellKor team who writes:&lt;br /&gt;&lt;br /&gt;&lt;blockquote&gt;Gavin, I am very impressed by the RMSE=&lt;span style="font-weight: bold;"&gt;0.9503&lt;/span&gt; achieved by pure global effects.&lt;/blockquote&gt;&lt;br /&gt;Yehuda Koren states that he does not believe that Global Effects alone will get someone to the NetFlix Prize.  He writes:&lt;br /&gt;&lt;br /&gt;&lt;blockquote&gt;Global effects are neat and intuitive, unlike the latent factor models (SVD and the likes) that by definition seek hidden, less explainable factors.&lt;br /&gt;However, I don’t expect that improving global effects will have much impact on the final RMSE. Latent factor models were proven much more effective at recovering the same information and much beyond. Of course, I might be wrong about this...&lt;/blockquote&gt;In my next article, I will focus on the k-nearest neighbor (knn) algorithm and try to reproduce the BellKor results for global effects + knn.&lt;br /&gt;&lt;br /&gt;Just to see what would happen, I ran the above "global effects" from the qualifying.txt and submitted to Netflix.  The result against the qualifying data was an &lt;span style="font-weight: bold;"&gt;RMSE = 0.9665&lt;/span&gt; which is very close to the &lt;span style="font-weight: bold;"&gt;RMSE = 0.9658&lt;/span&gt; that I received against the probe data.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-size:130%;"&gt;&lt;span style="font-weight: bold;"&gt;References&lt;/span&gt;&lt;/span&gt;&lt;br /&gt;&lt;ul&gt;&lt;li&gt;&lt;a href="http://www.netflixprize.com/community/viewtopic.php?id=847"&gt;"Weights for Global Effects"&lt;/a&gt;, Netflix Prize Forum&lt;/li&gt;&lt;li&gt;&lt;a href="http://www.netflixprize.com/community/viewtopic.php?id=772"&gt;"Loose Ends"&lt;/a&gt;, Netflix Prize Forum&lt;/li&gt;&lt;li&gt;&lt;a href="http://public.research.att.com/%7Evolinsky/netflix/cfworkshop.pdf"&gt;"Improved Neighborhood-based Collaborative Filtering"&lt;/a&gt;,Robert M. Bell and Yehuda Koren, AT&amp;amp;T Labs – Research&lt;br /&gt;&lt;/li&gt;&lt;/ul&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8592569895405724581-8788475394061075914?l=algorithmsanalyzed.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://algorithmsanalyzed.blogspot.com/feeds/8788475394061075914/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=8592569895405724581&amp;postID=8788475394061075914' title='31 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/8592569895405724581/posts/default/8788475394061075914'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/8592569895405724581/posts/default/8788475394061075914'/><link rel='alternate' type='text/html' href='http://algorithmsanalyzed.blogspot.com/2008/05/bellkor-algorithm-global-effects.html' title='BellKor Algorithm: Global Effects'/><author><name>Larry Freeman</name><uri>http://www.blogger.com/profile/06906614246430481533</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://4.bp.blogspot.com/_wqJeUjTB5sE/SKZUgwi8H5I/AAAAAAAABJ0/miQr_4wF1eg/S220/picture.jpg'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://1.bp.blogspot.com/_wqJeUjTB5sE/SBywB8oPiTI/AAAAAAAAA9M/0QlsiDyERic/s72-c/unbiasedestimator.png' height='72' width='72'/><thr:total>31</thr:total></entry><entry><id>tag:blogger.com,1999:blog-8592569895405724581.post-5458851225717417701</id><published>2008-03-06T21:52:00.000-08:00</published><updated>2008-12-10T01:47:13.810-08:00</updated><title type='text'>BellKor Algorithm: Neighborhood Based Approach</title><content type='html'>The content that follows is taken from Robert M. Bell and Yehuda Koren's article &lt;a href="http://public.research.att.com/%7Eyehuda/pubs/cf-workshop.pdf"&gt;"Improved Neighborhood-based Collaborative Filtering."&lt;/a&gt;  This article is cited as the reference for the Neighborhood-based approach used in the &lt;a href="http://algorithmsanalyzed.blogspot.com/2008/03/bellkor-algorithm-overview.html"&gt;BellKor algorithm&lt;/a&gt;.&lt;br /&gt;&lt;br /&gt;The Neighborhood-based algorithm as described by the BellKor team has three major steps:&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;1)  data normalization&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;In this step, an attempt is made to control for "global effects", that is, effects where certain users tend to rate movies higher or lower or it can also include interaction effects such that the probability that the Fellowship of the Rings, Two Towers, Return of the King will probably have similar ratings.  Movies may get higher ratings when they first come out or lower ratings when they've been out a very long time.  Some movies are have a large number of ratings.  Some movies have a much smaller number.&lt;br /&gt;&lt;br /&gt;For the probe data, predicting movie ratings based solely on the mean results in an &lt;span style="font-weight: bold;"&gt;RMSE&lt;/span&gt; of &lt;span style="font-weight: bold;"&gt;1.1296&lt;/span&gt; (see my &lt;a href="http://algorithmsanalyzed.blogspot.com/2008/02/netflix-prize-1-million-challenge.html"&gt;earlier blog&lt;/a&gt; for review of &lt;span style="font-weight: bold;"&gt;RMSE&lt;/span&gt; if needed).  Controlling for "global effects" gave this same date an &lt;span style="font-weight: bold;"&gt;RMSE of 0.9657&lt;/span&gt; (according to the cited article).&lt;br /&gt;&lt;br /&gt;In the actual &lt;span style="font-weight: bold;"&gt;kNN&lt;/span&gt; algorithm, residual data is processed.  Using the algorithm without removing "global effects" resulted in an &lt;span style="font-weight: bold;"&gt;RMSE&lt;/span&gt; of &lt;span style="font-weight: bold;"&gt;0.9536&lt;/span&gt; (according to the cited article).  Using &lt;span style="font-weight: bold;"&gt;kNN&lt;/span&gt; after removing "global effects" and matrix factorization data resulted in an &lt;span style="font-weight: bold;"&gt;RMSE&lt;/span&gt; of &lt;span style="font-weight: bold;"&gt;0.9071&lt;/span&gt; (according to the cited article).&lt;br /&gt;&lt;br /&gt;I will not cover the removal of "global effects" or matrix factorization in today's blog.  For purposes of focusing solely on the &lt;span style="font-weight: bold;"&gt;kNN&lt;/span&gt; algorithm, I wil assume that all data is raw (that is, it has not been normalized).&lt;br /&gt;&lt;br /&gt;In a future blog, I will cover "global effects" and matrix factorization and try to reproduce the BellKor results cited.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;2)  neighbor selection&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;This is where a similarity function is selected and where this similiarity function is used to determine the &lt;span style="font-weight: bold;"&gt;k&lt;/span&gt; nearest neighbors.  This can be approached in a user-oriented way (a function on similiar users) or in an item-oriented way (a function on simlar items).&lt;br /&gt;&lt;br /&gt;I will go into detail about the formulas used in the article for the similarity function.  In a future blog, I will get into code and compare the BellKor approach to other kNN approaches.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;3)  determination of the interpolation weights&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;Determination of interpolation weights is very important since they are tied very closed to the prediction.  Bell and Koren compare results they reached with results from a "correlation-based interpolation."&lt;br /&gt;&lt;br /&gt;They found that for raw scores, correlation-based interpolation resulted in an &lt;span style="font-weight: bold;"&gt;RMSE&lt;/span&gt; of &lt;span style="font-weight: bold;"&gt;0.9947&lt;/span&gt;.  In contrast, their approach resulted in an &lt;span style="font-weight: bold;"&gt;RMSE&lt;/span&gt; of &lt;span style="font-weight: bold;"&gt;0.9536&lt;/span&gt;.  After removing "global effects" and matrix factorization results, the correlated-based approach resulted in &lt;span style="font-weight: bold;"&gt;0.9142&lt;/span&gt; in comparison to team BellKor's result of &lt;span style="font-weight: bold;"&gt;0.9071&lt;/span&gt;.&lt;br /&gt;&lt;br /&gt;Below I will detail the formulas for BellKor approach of "jointly derived interpolation."&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;Disclaimer&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;I would also like to make a clear disclaimer.  I am not an expert in learning theory nor the ideas presented in the article.  This summary is my best effort to understand the formulas in the article.  I would encourage everyone to read the original articles.  If you are an expert in learning theory, please review this article and post your comments.  I am very glad to be corrected on any point which I do not accurately represent.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;Neighborhood-based Approach&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;The Neighborhood-based approach attempts to predict how a user &lt;span style="font-weight: bold;"&gt;u&lt;/span&gt; will rate an item &lt;span style="font-weight: bold;"&gt;i &lt;/span&gt;by looking at previous data.  The &lt;span style="font-style: italic;"&gt;user-oriented&lt;/span&gt; approach looks for users similar to &lt;span style="font-weight: bold;"&gt;u&lt;/span&gt; that have already rated item &lt;span style="font-weight: bold;"&gt;i&lt;/span&gt;.  The &lt;span style="font-style: italic;"&gt;item-oriented approach&lt;/span&gt; looks for items similar to &lt;span style="font-weight: bold;"&gt;i&lt;/span&gt; that user &lt;span style="font-weight: bold;"&gt;u&lt;/span&gt; has already rated.  In both cases, the strategy is to find a set of most similar items and then use this set to make the prediction.&lt;br /&gt;&lt;br /&gt;One of the first questions that need to be answered is how many elements need to be in such a set?  Because of the importance of this question, the neighborhood-based approach is commonly called the "&lt;span style="font-weight: bold;"&gt;k&lt;/span&gt; Nearest Neighbors" (&lt;span style="font-weight: bold;"&gt;kNN&lt;/span&gt;) approach.  In the BellKor article, they considered values of &lt;span style="font-weight: bold;"&gt;k=20&lt;/span&gt;, &lt;span style="font-weight: bold;"&gt;k=35&lt;/span&gt;, and &lt;span style="font-weight: bold;"&gt;k=50&lt;/span&gt;.  For the raw data, they found the best result with &lt;span style="font-weight: bold;"&gt;k=20&lt;/span&gt; (&lt;span style="font-weight: bold;"&gt;RMSE = 0.9536&lt;/span&gt;).  For the normalized data, they found the same result for all three (&lt;span style="font-weight: bold;"&gt;RMSE=0.9071&lt;/span&gt;).&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;Notation&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;As far as notation, the article refers to users with the variables &lt;span style="font-weight: bold;"&gt;u,v&lt;/span&gt; and refers to items with the variables &lt;span style="font-weight: bold;"&gt;i,j,k&lt;/span&gt;.&lt;br /&gt;&lt;br /&gt;The set of &lt;span style="font-weight: bold;"&gt;k&lt;/span&gt; Nearest Neighbors is represented as &lt;span style="font-weight: bold;"&gt;N(u;i)&lt;/span&gt; in the case of similar users (the user-oriented approach) or &lt;span style="font-weight: bold;"&gt;N(i;u)&lt;/span&gt; in the case of similar items (the item-oriented approach).&lt;br /&gt;&lt;br /&gt;Rating by user &lt;span style="font-weight: bold;"&gt;u&lt;/span&gt; of an item &lt;span style="font-weight: bold;"&gt;i&lt;/span&gt; i s represented as &lt;span style="font-weight: bold;"&gt;r&lt;/span&gt;&lt;sub style="font-weight: bold;"&gt;ui&lt;/sub&gt;.&lt;br /&gt;&lt;br /&gt;Selection of &lt;span style="font-weight: bold;"&gt;N(u;i)&lt;/span&gt; or &lt;span style="font-weight: bold;"&gt;N(i;u)&lt;/span&gt; is based on a similarity function &lt;span style="font-weight: bold;"&gt;s&lt;/span&gt;&lt;sub style="font-weight: bold;"&gt;ij&lt;/sub&gt; when item &lt;span style="font-weight: bold;"&gt;i&lt;/span&gt; is compared to item &lt;span style="font-weight: bold;"&gt;j&lt;/span&gt; and &lt;span style="font-weight: bold;"&gt;s&lt;/span&gt;&lt;sub style="font-weight: bold;"&gt;uv&lt;/sub&gt; if user &lt;span style="font-weight: bold;"&gt;u&lt;/span&gt; is compared to user &lt;span style="font-weight: bold;"&gt;v&lt;/span&gt;.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;U(i,j)&lt;/span&gt; represents the set of users who have rated both item &lt;span style="font-weight: bold;"&gt;i&lt;/span&gt; and item &lt;span style="font-weight: bold;"&gt;j&lt;/span&gt; and &lt;span style="font-weight: bold;"&gt;|U(i,j)|&lt;/span&gt; represents the number of users rating both &lt;span style="font-weight: bold;"&gt;i&lt;/span&gt; and &lt;span style="font-weight: bold;"&gt;j&lt;/span&gt;.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;I(u,v)&lt;/span&gt; represents the set of items that are rated by both user &lt;span style="font-weight: bold;"&gt;u&lt;/span&gt; and user &lt;span style="font-weight: bold;"&gt;v&lt;/span&gt; and &lt;span style="font-weight: bold;"&gt;|I(u,v)|&lt;/span&gt; represents the number of items that are rated by both user &lt;span style="font-weight: bold;"&gt;u&lt;/span&gt; and user &lt;span style="font-weight: bold;"&gt;v&lt;/span&gt;.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;Similarity Function and Neighbor Selection&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;To build &lt;span style="font-weight: bold;"&gt;N(i;u)&lt;/span&gt;, we need a similarity function to determine the similarity of another item &lt;span style="font-weight: bold;"&gt;j&lt;/span&gt; to the given item &lt;span style="font-weight: bold;"&gt;i&lt;/span&gt;.  In other words, we need&lt;span style="font-weight: bold;"&gt; s&lt;/span&gt;&lt;sub style="font-weight: bold;"&gt;ij&lt;/sub&gt;.&lt;br /&gt;&lt;br /&gt;The formula for &lt;span style="font-weight: bold;"&gt;s&lt;/span&gt;&lt;sub style="font-weight: bold;"&gt;ij&lt;/sub&gt; is:&lt;br /&gt;&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://3.bp.blogspot.com/_wqJeUjTB5sE/R9Df8bbX8CI/AAAAAAAAA6s/Hmyx8kW_N54/s1600-h/item1.png"&gt;&lt;img style="cursor: pointer;" src="http://3.bp.blogspot.com/_wqJeUjTB5sE/R9Df8bbX8CI/AAAAAAAAA6s/Hmyx8kW_N54/s400/item1.png" alt="" id="BLOGGER_PHOTO_ID_5174882201096417314" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;where &lt;span style="font-weight: bold;"&gt;α&lt;/span&gt; is a small value that we can play around with.&lt;br /&gt;&lt;br /&gt;To build &lt;span style="font-weight: bold;"&gt;N(u;i)&lt;/span&gt;, we need a similarity function to determine the similarity of another user &lt;span style="font-weight: bold;"&gt;v&lt;/span&gt; to the  given user to &lt;span style="font-weight: bold;"&gt;u&lt;/span&gt;.  In other words, we need &lt;span style="font-weight: bold;"&gt;s&lt;/span&gt;&lt;sub style="font-weight: bold;"&gt;uv&lt;/sub&gt;.&lt;br /&gt;&lt;br /&gt;The formula for &lt;span style="font-weight: bold;"&gt;s&lt;/span&gt;&lt;sub style="font-weight: bold;"&gt;uv&lt;/sub&gt; is:&lt;br /&gt;&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://4.bp.blogspot.com/_wqJeUjTB5sE/R9DhtrbX8DI/AAAAAAAAA60/MtdFIRuWSfE/s1600-h/item2.png"&gt;&lt;img style="cursor: pointer;" src="http://4.bp.blogspot.com/_wqJeUjTB5sE/R9DhtrbX8DI/AAAAAAAAA60/MtdFIRuWSfE/s400/item2.png" alt="" id="BLOGGER_PHOTO_ID_5174884146716602418" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;where &lt;span style="font-weight: bold;"&gt;α&lt;/span&gt; is a small value that we can play around with.&lt;br /&gt;&lt;br /&gt;So, we cnow build the set &lt;span style="font-weight: bold;"&gt;N(u;i)&lt;/span&gt; which consists of the &lt;span style="font-weight: bold;"&gt;k&lt;/span&gt;-most similar users that have rated item&lt;span style="font-weight: bold;"&gt; i&lt;/span&gt; (based on maximizing&lt;span style="font-weight: bold;"&gt; s&lt;/span&gt;&lt;sub style="font-weight: bold;"&gt;uv&lt;/sub&gt;) and we can build the set &lt;span style="font-weight: bold;"&gt;N(i;u)&lt;/span&gt; which consists of the &lt;span style="font-weight: bold;"&gt;k&lt;/span&gt;-most similar items that have been rated by user &lt;span style="font-weight: bold;"&gt;u&lt;/span&gt; (based on maximizing &lt;span style="font-weight: bold;"&gt;s&lt;/span&gt;&lt;sub style="font-weight: bold;"&gt;ij&lt;/sub&gt;).&lt;br /&gt;&lt;br /&gt;In their article, I am unclear on how &lt;span style="font-weight: bold;"&gt;α&lt;/span&gt; is determined.  They say "for some small &lt;span style="font-weight: bold;"&gt;α&lt;/span&gt;".&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;Determining Interpolation Weights&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;Next, we need to determine interpolation weights for the prediction.&lt;br /&gt;&lt;br /&gt;If using a user-oriented approach, then each weight is &lt;span style="font-weight: bold;"&gt;w&lt;/span&gt;&lt;sub style="font-weight: bold;"&gt;uv&lt;/sub&gt;.&lt;br /&gt;&lt;br /&gt;If using an item-oriented approach, then each weight is &lt;span style="font-weight: bold;"&gt;w&lt;/span&gt;&lt;sub style="font-weight: bold;"&gt;ij&lt;/sub&gt;.&lt;br /&gt;&lt;br /&gt;The goal is ultimately to use one of the following formulas for a prediction:&lt;br /&gt;&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://2.bp.blogspot.com/_wqJeUjTB5sE/R9L3PHCRIpI/AAAAAAAAA8k/WHmmLTW9ZRE/s1600-h/fix1.png"&gt;&lt;img style="cursor: pointer;" src="http://2.bp.blogspot.com/_wqJeUjTB5sE/R9L3PHCRIpI/AAAAAAAAA8k/WHmmLTW9ZRE/s400/fix1.png" alt="" id="BLOGGER_PHOTO_ID_5175470760760779410" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;or&lt;br /&gt;&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://3.bp.blogspot.com/_wqJeUjTB5sE/R9L3bXCRIqI/AAAAAAAAA8s/PH6hUulauyo/s1600-h/fix2.png"&gt;&lt;img style="cursor: pointer;" src="http://3.bp.blogspot.com/_wqJeUjTB5sE/R9L3bXCRIqI/AAAAAAAAA8s/PH6hUulauyo/s400/fix2.png" alt="" id="BLOGGER_PHOTO_ID_5175470971214176930" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;If the data is dense enough, then the interpolation can be determined as a least squares problem:&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;Aw = b&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;which gives us:&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;w = A&lt;/span&gt;&lt;sup style="font-weight: bold;"&gt;-1&lt;/sup&gt;&lt;span style="font-weight: bold;"&gt;b&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;where &lt;span style="font-weight: bold;"&gt;A&lt;/span&gt; is an invertible &lt;span style="font-weight: bold;"&gt;K x K&lt;/span&gt; matrix defined as:&lt;br /&gt;&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://1.bp.blogspot.com/_wqJeUjTB5sE/R9IAg3CRIeI/AAAAAAAAA7M/LzkaVcdzxqg/s1600-h/item5.png"&gt;&lt;img style="cursor: pointer;" src="http://1.bp.blogspot.com/_wqJeUjTB5sE/R9IAg3CRIeI/AAAAAAAAA7M/LzkaVcdzxqg/s400/item5.png" alt="" id="BLOGGER_PHOTO_ID_5175199486331396578" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;and &lt;span style="font-weight: bold;"&gt;b ∈ R&lt;/span&gt;&lt;sup style="font-weight: bold;"&gt;K&lt;/sup&gt;  where:&lt;br /&gt;&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://3.bp.blogspot.com/_wqJeUjTB5sE/R9IBUXCRIfI/AAAAAAAAA7U/m7q2tekNIXI/s1600-h/item6.png"&gt;&lt;img style="cursor: pointer;" src="http://3.bp.blogspot.com/_wqJeUjTB5sE/R9IBUXCRIfI/AAAAAAAAA7U/m7q2tekNIXI/s400/item6.png" alt="" id="BLOGGER_PHOTO_ID_5175200371094659570" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;Since, it is unreasonable to assume that the data is dense enough, one improvement would be to average each sum over its total number of members.  In this case:&lt;br /&gt;&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://1.bp.blogspot.com/_wqJeUjTB5sE/R9IDt3CRIgI/AAAAAAAAA7c/Ni_D7u2CNHc/s1600-h/item7.png"&gt;&lt;img style="cursor: pointer;" src="http://1.bp.blogspot.com/_wqJeUjTB5sE/R9IDt3CRIgI/AAAAAAAAA7c/Ni_D7u2CNHc/s400/item7.png" alt="" id="BLOGGER_PHOTO_ID_5175203008204579330" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;and:&lt;br /&gt;&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://4.bp.blogspot.com/_wqJeUjTB5sE/R9IG9nCRIlI/AAAAAAAAA8E/XMOlm3QzMyI/s1600-h/item8.png"&gt;&lt;img style="cursor: pointer;" src="http://4.bp.blogspot.com/_wqJeUjTB5sE/R9IG9nCRIlI/AAAAAAAAA8E/XMOlm3QzMyI/s400/item8.png" alt="" id="BLOGGER_PHOTO_ID_5175206577322402386" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;where &lt;span style="font-weight: bold;"&gt;U(j,k)&lt;/span&gt; is the set of users who rated both item &lt;span style="font-weight: bold;"&gt;j &lt;/span&gt;and item &lt;span style="font-weight: bold;"&gt;k&lt;/span&gt; and &lt;span style="font-weight: bold;"&gt;U(i,j)&lt;/span&gt; is the set of users hwo rated both item &lt;span style="font-weight: bold;"&gt;i &lt;/span&gt;and item &lt;span style="font-weight: bold;"&gt;j&lt;/span&gt;.&lt;br /&gt;&lt;br /&gt;But this can be quite inaccurate for very small sets of &lt;span style="font-weight: bold;"&gt;U(i,j)&lt;/span&gt;.  To remedy this, the article recommends using a baseline value which they &lt;span style="font-weight: bold;"&gt;avg&lt;/span&gt;.&lt;br /&gt;&lt;br /&gt;We will create two estimates of the above values:&lt;br /&gt;&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://1.bp.blogspot.com/_wqJeUjTB5sE/R9JaG3CRImI/AAAAAAAAA8M/xtdJPyOa7VU/s1600-h/item11.png"&gt;&lt;img style="cursor: pointer;" src="http://1.bp.blogspot.com/_wqJeUjTB5sE/R9JaG3CRImI/AAAAAAAAA8M/xtdJPyOa7VU/s400/item11.png" alt="" id="BLOGGER_PHOTO_ID_5175297995701297762" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;and&lt;br /&gt;&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://4.bp.blogspot.com/_wqJeUjTB5sE/R9JaTnCRInI/AAAAAAAAA8U/oXaDmZNL0kg/s1600-h/item12.png"&gt;&lt;img style="cursor: pointer;" src="http://4.bp.blogspot.com/_wqJeUjTB5sE/R9JaTnCRInI/AAAAAAAAA8U/oXaDmZNL0kg/s400/item12.png" alt="" id="BLOGGER_PHOTO_ID_5175298214744629874" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;In the above estimates, "the parameter &lt;span style="font-weight: bold;"&gt;β&lt;/span&gt; controls the extent of the shrinkage. "  For example, in the article, they talk about their reasons for using a &lt;span style="font-weight: bold;"&gt;β = 500&lt;/span&gt; when used with normalized data.&lt;br /&gt;&lt;br /&gt;The value of &lt;span style="font-weight: bold;"&gt;avg&lt;/span&gt; is computed based on the following matrix:&lt;br /&gt;&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://4.bp.blogspot.com/_wqJeUjTB5sE/R9JcInCRIoI/AAAAAAAAA8c/TPfLaWvrqGM/s1600-h/item13.png"&gt;&lt;img style="cursor: pointer;" src="http://4.bp.blogspot.com/_wqJeUjTB5sE/R9JcInCRIoI/AAAAAAAAA8c/TPfLaWvrqGM/s400/item13.png" alt="" id="BLOGGER_PHOTO_ID_5175300224789324418" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;where if &lt;span style="font-weight: bold;"&gt;jk&lt;/span&gt; above is a diagonal entry, then &lt;span style="font-weight: bold;"&gt;avg =&lt;/span&gt; the average of diagonal entries of &lt;span style="text-decoration: overline; font-weight: bold;"&gt;A&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;and if &lt;span style="font-weight: bold;"&gt;jk&lt;/span&gt; above is a nondiagonal entry, then &lt;span style="font-weight: bold;"&gt;avg =&lt;/span&gt; the average of nondiagonal entries of &lt;span style="text-decoration: overline; font-weight: bold;"&gt;A&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;This then completes the formulas as presented in their article on the Neighborhood Based Approach.  In my next blog, I will look into code for implementing the kNN approach and see if I am able to reproduce the RMSE scores reported by the BellKor team.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;References&lt;/span&gt;&lt;br /&gt;&lt;ul&gt;&lt;li&gt;R. Bell and Y. Koren, &lt;a href="http://public.research.att.com/%7Eyehuda/pubs/cf-workshop.pdf"&gt;"Improved Neighborhood-bsaed Collaborative Filtering"&lt;/a&gt;, KDD-Cup and Workshop, ACM Press, 2007&lt;/li&gt;&lt;li&gt;Robert M. Bell, Yehuda Koren, and Chris Volinsky, &lt;a href="http://www.netflixprize.com/assets/ProgressPrize2007_KorBell.pdf"&gt;"The BellKor solution to the Netflix Prize"&lt;/a&gt; &lt;/li&gt;&lt;/ul&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8592569895405724581-5458851225717417701?l=algorithmsanalyzed.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://algorithmsanalyzed.blogspot.com/feeds/5458851225717417701/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=8592569895405724581&amp;postID=5458851225717417701' title='7 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/8592569895405724581/posts/default/5458851225717417701'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/8592569895405724581/posts/default/5458851225717417701'/><link rel='alternate' type='text/html' href='http://algorithmsanalyzed.blogspot.com/2008/03/bellkor-algorithm-neighborhood-based_06.html' title='BellKor Algorithm: Neighborhood Based Approach'/><author><name>Larry Freeman</name><uri>http://www.blogger.com/profile/06906614246430481533</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://4.bp.blogspot.com/_wqJeUjTB5sE/SKZUgwi8H5I/AAAAAAAABJ0/miQr_4wF1eg/S220/picture.jpg'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://3.bp.blogspot.com/_wqJeUjTB5sE/R9Df8bbX8CI/AAAAAAAAA6s/Hmyx8kW_N54/s72-c/item1.png' height='72' width='72'/><thr:total>7</thr:total></entry><entry><id>tag:blogger.com,1999:blog-8592569895405724581.post-354292940582203929</id><published>2008-03-02T22:03:00.000-08:00</published><updated>2008-03-08T17:55:07.269-08:00</updated><title type='text'>BellKor Algorithm: Overview</title><content type='html'>In their article "The Bellkor solution to the Netflix Prize," the authors state that their algorithm is the blending of "107 individual results."&lt;br /&gt;&lt;br /&gt;In the solution outlined, the article presupposes detailed understanding of each of the approaches as well as the linear regression analysis used to blend the results.&lt;br /&gt;&lt;br /&gt;They list out the 107 results and categorize the results into the following categories:&lt;br /&gt;&lt;br /&gt;1.  Asymmetric factor models  (9 results)&lt;br /&gt;2.  Regression models (15 results)&lt;br /&gt;3.  Restricted Boltzmann Machines with Gaussian visible units (8 results)&lt;br /&gt;4.  Restricted Boltzmann Machines (8 results)&lt;br /&gt;5.  Matrix factorization (30 results)&lt;br /&gt;6.  &lt;a href="http://algorithmsanalyzed.blogspot.com/2008/03/bellkor-algorithm-neighborhood-based_06.html"&gt;Neighborhood-based model (k-NN)  &lt;/a&gt;(18 results)&lt;br /&gt;7.  Combinations (9 results)&lt;br /&gt;8.  Imputation of Qualifying predictions (7 results)&lt;br /&gt;9.  Specials (3 results)&lt;br /&gt;&lt;br /&gt;Blending is accomplished by viewing the combination of results "as a linear regression problem".&lt;br /&gt;&lt;br /&gt;In retrospect, the authors do not believe that all 107 results are necessary.  They write:&lt;br /&gt;&lt;blockquote&gt;For completeness, we listed all 107 results that were blended in our RMSE=0.8712 submission.  It is important to note that the major reason for using all these results was convenience, as we anyway accumulated them during the year.  This is the nature of an ongoing competition.  However, in hindsight, we would probably drop many of these results and recompute some in a different way.  We believe that far fewer results are really necessary.&lt;br /&gt;&lt;br /&gt;&lt;/blockquote&gt;They then list out some of the factors out of the 107 that have the most impact:&lt;br /&gt;&lt;ul&gt;&lt;li&gt;A blend of 3 results leads to an RMSE of 0.8793&lt;/li&gt;&lt;li&gt;A blend of 11 results leads to an 8% improvement over the Cinematch score.&lt;br /&gt;&lt;/li&gt;&lt;/ul&gt;Their recommended approach to the Netflix Challenge is the following:&lt;br /&gt;&lt;blockquote&gt;Predictive accuracy is substantially improved when blending mutliple predictors.&lt;span style="font-style: italic;"&gt;  Our experience is that most efforts should be concentrated in deriving substantially different approaches, rather than refining a single technique.&lt;/span&gt; (italics in original)&lt;br /&gt;&lt;br /&gt;&lt;/blockquote&gt;In my next blog, I will go into more detail on the Neighborhood-based model (k-NN).&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;&lt;br /&gt;References&lt;/span&gt;&lt;br /&gt;&lt;ul&gt;&lt;li&gt;R. Bell, Y. Koren, C. Volinsky (2007) &lt;a href="http://www.netflixprize.com/assets/ProgressPrize2007_KorBell.pdf"&gt;"The Bellkor Solution to the Netflix Prize"&lt;/a&gt;, Netflix.com&lt;br /&gt;&lt;/li&gt;&lt;/ul&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8592569895405724581-354292940582203929?l=algorithmsanalyzed.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://algorithmsanalyzed.blogspot.com/feeds/354292940582203929/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=8592569895405724581&amp;postID=354292940582203929' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/8592569895405724581/posts/default/354292940582203929'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/8592569895405724581/posts/default/354292940582203929'/><link rel='alternate' type='text/html' href='http://algorithmsanalyzed.blogspot.com/2008/03/bellkor-algorithm-overview.html' title='BellKor Algorithm: Overview'/><author><name>Larry Freeman</name><uri>http://www.blogger.com/profile/06906614246430481533</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://4.bp.blogspot.com/_wqJeUjTB5sE/SKZUgwi8H5I/AAAAAAAABJ0/miQr_4wF1eg/S220/picture.jpg'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-8592569895405724581.post-8132366740184675653</id><published>2008-02-28T08:05:00.000-08:00</published><updated>2008-02-28T08:12:55.933-08:00</updated><title type='text'>Netflix Prize: Wired Ariticle just published</title><content type='html'>On February 25, &lt;a href="http://www.wired.com/techbiz/media/magazine/16-03/mf_netflix"&gt;an article was published&lt;/a&gt; in Wired Magazine about the contestants of the Netflix Prize.  It focuses on the BellKor team which won the &lt;a href="http://algorithmsanalyzed.blogspot.com/2008/02/50000-algorithm.html"&gt;Progress Prize&lt;/a&gt; and a psychologist, Gavin Potter, is apparently doing very well under the team name "Just a Guy in the Garage".&lt;br /&gt;&lt;br /&gt;In my next blog, I will give the high level view of the BellKor algorithm.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;References&lt;/span&gt;&lt;br /&gt;&lt;ul&gt;&lt;li&gt;Jordan Ellenberg, &lt;a href="http://www.wired.com/techbiz/media/magazine/16-03/mf_netflix"&gt;"This Psychologist Might Outsmart the Math Brains Competing for the Netflix Prize"&lt;/a&gt;, Wired, Feb 25, 2008.&lt;br /&gt;&lt;/li&gt;&lt;/ul&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8592569895405724581-8132366740184675653?l=algorithmsanalyzed.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://algorithmsanalyzed.blogspot.com/feeds/8132366740184675653/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=8592569895405724581&amp;postID=8132366740184675653' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/8592569895405724581/posts/default/8132366740184675653'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/8592569895405724581/posts/default/8132366740184675653'/><link rel='alternate' type='text/html' href='http://algorithmsanalyzed.blogspot.com/2008/02/netflix-prize-wired-ariticle-just.html' title='Netflix Prize: Wired Ariticle just published'/><author><name>Larry Freeman</name><uri>http://www.blogger.com/profile/06906614246430481533</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://4.bp.blogspot.com/_wqJeUjTB5sE/SKZUgwi8H5I/AAAAAAAABJ0/miQr_4wF1eg/S220/picture.jpg'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-8592569895405724581.post-5337348521007379074</id><published>2008-02-27T16:27:00.001-08:00</published><updated>2008-12-10T01:47:14.398-08:00</updated><title type='text'>Netflix Prize: The $1 Million Challenge</title><content type='html'>Participation in the NetFlix Challenge is simple.  First, a user registers a team at &lt;a href="http://netflixprize.com/"&gt;NetFlixPrize.com&lt;/a&gt;.  A team can consist of one or more persons.  You go &lt;a href="http://www.netflixprize.com//teams"&gt;here&lt;/a&gt; to register.&lt;br /&gt;&lt;br /&gt;Once registered, Netflix sends you the following data files:&lt;br /&gt;&lt;ul&gt;&lt;li&gt;a movie data set of over &lt;span style="font-weight: bold;"&gt;17000&lt;/span&gt; movies with three fields: &lt;span style="font-weight: bold;"&gt;(title, year, movie id)&lt;/span&gt;.&lt;/li&gt;&lt;li&gt;a training data set that includes over &lt;span style="font-weight: bold;"&gt;100 million&lt;/span&gt; ratings of &lt;span style="font-weight: bold;"&gt;480,000&lt;/span&gt; distinct users with four fields: &lt;span style="font-weight: bold;"&gt;(user id, movie id, rating, date)&lt;/span&gt;&lt;br /&gt;&lt;/li&gt;&lt;li&gt;a qualifying data set of &lt;span style="font-weight: bold;"&gt;2.8 million&lt;/span&gt; entries with three fields:  &lt;span style="font-weight: bold;"&gt;(user id, movie id, date)&lt;/span&gt; but no rating.&lt;br /&gt;&lt;/li&gt;&lt;/ul&gt;All data sets are presented as flat files.  In the case of the training data set, there is a distinct flat file given for each movie.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;Movie id&lt;/span&gt; consists of a number between &lt;span style="font-weight: bold;"&gt;1&lt;/span&gt; and the number of movies (over &lt;span style="font-weight: bold;"&gt;17000&lt;/span&gt;).  User id consists of a number between &lt;span style="font-weight: bold;"&gt;1&lt;/span&gt; and &lt;span style="font-weight: bold;"&gt;3 million&lt;/span&gt; (over &lt;span style="font-weight: bold;"&gt;480,000&lt;/span&gt; distinct user ids).  Rating is an integer between &lt;span style="font-weight: bold;"&gt;1&lt;/span&gt; and &lt;span style="font-weight: bold;"&gt;5&lt;/span&gt; where &lt;span style="font-weight: bold;"&gt;1&lt;/span&gt; is the lowest rating (worst movie) and &lt;span style="font-weight: bold;"&gt;5&lt;/span&gt; is the highest rating (best movie).&lt;br /&gt;&lt;br /&gt;The task is to generate ratings for the &lt;span style="font-weight: bold;"&gt;2.8 million&lt;/span&gt; entries of the qualifying data set based on the training data files.  The generated ratings do not have to be integers so you can estimate &lt;span style="font-weight: bold;"&gt;1.1, 0.5, 2.5, 5.5,&lt;/span&gt; etc for each record in the qualifying data set.&lt;br /&gt;&lt;br /&gt;Each submission of generated ratings for the qualifying data set is then scored against the actual ratings (which are kept secret by NetFlix) using the &lt;a href="http://en.wikipedia.org/wiki/Root_mean_square_deviation"&gt;root mean squared error&lt;/a&gt; (RMSE).  If &lt;span style="font-weight: bold;"&gt;p&lt;/span&gt;&lt;sub style="font-weight: bold;"&gt;i&lt;/sub&gt; is the prediction and &lt;span style="font-weight: bold;"&gt;r&lt;/span&gt;&lt;sub style="font-weight: bold;"&gt;i&lt;/sub&gt; is the actual rating, then the RMSE score for a submitted qualifying data set is:&lt;br /&gt;&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://1.bp.blogspot.com/_wqJeUjTB5sE/R8YHGB_sShI/AAAAAAAAA6E/JYwLtlhkMnk/s1600-h/rmse.png"&gt;&lt;img style="cursor: pointer;" src="http://1.bp.blogspot.com/_wqJeUjTB5sE/R8YHGB_sShI/AAAAAAAAA6E/JYwLtlhkMnk/s400/rmse.png" alt="" id="BLOGGER_PHOTO_ID_5171829022277782034" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;NetFlix provides a tool written in perl for calculating the RMSE.  Of course, it is pretty easy to write your own tool since the above formula is equivalent to:&lt;br /&gt;&lt;listing&gt;&lt;br /&gt;double calcRMSE(double[] p, double[] r) {&lt;br /&gt;   &lt;br /&gt;    int size = p.length;&lt;br /&gt;    double sum = 0;&lt;br /&gt;   &lt;br /&gt;    for (int i=0; i &lt; size; i++) {&lt;br /&gt;         double diff = p[i] - r[i];&lt;br /&gt;         sum += diff*diff;           &lt;br /&gt;    }&lt;br /&gt;&lt;br /&gt;    return Math.sqrt(sum/size);&lt;br /&gt; &lt;br /&gt;}&lt;br /&gt;&lt;/listing&gt;&lt;br /&gt;Each team can make a submission of these predicted ratings once every 24 hours.  Later, they will receive their RMSE score from NetFlix.&lt;br /&gt;&lt;br /&gt;The top scores are posted on the web site on the &lt;a href="http://www.netflixprize.com//leaderboard"&gt;Leaderboard&lt;/a&gt;.  Interestingly, many of the teams on the leaderboard have hyperlinks to their web pages so if you are interested, you can browse their papers and their home pages.&lt;br /&gt;&lt;br /&gt;To win the $1 million, a team needs to have a score lower than &lt;span style="font-weight: bold;"&gt;0.8572&lt;/span&gt;.  If more than one team has a score lower than &lt;span style="font-weight: bold;"&gt;0.8572&lt;/span&gt;, then the team with the lowest score will win.  If after one year, no team has a score lower than &lt;span style="font-weight: bold;"&gt;0.8572&lt;/span&gt;, then the top score will win a progress prize of $50,000.&lt;br /&gt;&lt;br /&gt;To claim any money, a team must submit their algorithm within one week of notification of their score.  This algorithm must be wholly original or at least not involve any license requirements from third-party software sources.  The full details for this contest can be found &lt;a href="http://www.netflixprize.com//rules"&gt;here&lt;/a&gt;.&lt;br /&gt;&lt;br /&gt;At the time that I am writing this blog, the top RMSE score for a qualifying data set is: &lt;span style="font-weight: bold;"&gt;0.8684&lt;/span&gt;.&lt;br /&gt;&lt;br /&gt;So far, only one progress prize has been awarded and this was awarded to a team with the score of &lt;span style="font-weight: bold;"&gt;0.8712&lt;/span&gt;.  This team is called Korbell and here's their &lt;a href="http://www.research.att.com/%7Evolinsky/netflix/"&gt;home page&lt;/a&gt;.&lt;br /&gt;&lt;br /&gt;In my next blog, I will start reviewing the details of the algorithm that got to &lt;span style="font-weight: bold;"&gt;0.8712&lt;/span&gt;.  The team that produced that score is also the team that currently has the lowest score at &lt;span style="font-weight: bold;"&gt;0.8684&lt;/span&gt;.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;References &lt;/span&gt;&lt;br /&gt;&lt;ul&gt;&lt;li&gt;&lt;a href="http://netflixprize.com/"&gt;NetFlix Prize Home Page&lt;/a&gt;&lt;/li&gt;&lt;li&gt;&lt;a href="http://en.wikipedia.org/wiki/Netflix_prize"&gt;"NetFlix Prize"&lt;/a&gt;, Wikipedia.com&lt;/li&gt;&lt;li&gt;&lt;a href="http://en.wikipedia.org/wiki/Root_mean_square_deviation"&gt;"RMSE"&lt;/a&gt;, Wikipedia.com&lt;br /&gt;&lt;/li&gt;&lt;/ul&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8592569895405724581-5337348521007379074?l=algorithmsanalyzed.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://algorithmsanalyzed.blogspot.com/feeds/5337348521007379074/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=8592569895405724581&amp;postID=5337348521007379074' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/8592569895405724581/posts/default/5337348521007379074'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/8592569895405724581/posts/default/5337348521007379074'/><link rel='alternate' type='text/html' href='http://algorithmsanalyzed.blogspot.com/2008/02/netflix-prize-1-million-challenge.html' title='Netflix Prize: The $1 Million Challenge'/><author><name>Larry Freeman</name><uri>http://www.blogger.com/profile/06906614246430481533</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://4.bp.blogspot.com/_wqJeUjTB5sE/SKZUgwi8H5I/AAAAAAAABJ0/miQr_4wF1eg/S220/picture.jpg'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://1.bp.blogspot.com/_wqJeUjTB5sE/R8YHGB_sShI/AAAAAAAAA6E/JYwLtlhkMnk/s72-c/rmse.png' height='72' width='72'/><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-8592569895405724581.post-7351861607087254236</id><published>2008-02-22T14:14:00.000-08:00</published><updated>2008-02-22T14:27:17.557-08:00</updated><title type='text'>The $50,000 algorithm</title><content type='html'>Before jumping into the great work by Donald Knuth, I thought it would be interesting to start with analysis of a contemporary algorithm which has already given its authors $50,000.&lt;br /&gt;&lt;br /&gt;NetFlix currently has a &lt;a href="http://www.netflixprize.com/index"&gt;$1 million contest&lt;/a&gt; to anyone who can write an algorithm for predicting user ratings that beats their algorithm by at least 10%.  This content has run since 2006 and so far, no one has done it.&lt;br /&gt;&lt;br /&gt;Each year, if no one has claimed the top prize, NetFlix awards a $50,000 progress prize to the best algorithm to date.  The first such prize was given this year.  In order to claim the prize, the individual or team that wrote the algorithm must publish its details.&lt;br /&gt;&lt;br /&gt;This year's prize went to Robert M. Bell, Yehuda Koren, and Chris Volinsky, members of the AT&amp;amp;T Research Labs.  Considering that there are over 9,000 teams competing for the prize, AT&amp;amp;T can be very proud of their achievement.&lt;br /&gt;&lt;br /&gt;As a starter to this blog, I thought it would be interesting to go in detail through the algorithm that won the 2007 Progress Prize.&lt;br /&gt;&lt;br /&gt;References&lt;br /&gt;&lt;ul&gt;&lt;li&gt;NetFlix Prize, &lt;a href="http://www.netflixprize.com/index"&gt;Home Page&lt;/a&gt;&lt;/li&gt;&lt;li&gt;&lt;a href="http://en.wikipedia.org/wiki/Netflix_Prize"&gt;"Netflix Prize"&lt;/a&gt;, Wikipedia&lt;/li&gt;&lt;li&gt;Robert M. Bell, Yehuda Koren, Chris Volinsky, &lt;a href="http://www.netflixprize.com/assets/ProgressPrize2007_KorBell.pdf"&gt;"The BellKor solution to the NetFlix Prize"&lt;/a&gt;&lt;br /&gt;&lt;/li&gt;&lt;/ul&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8592569895405724581-7351861607087254236?l=algorithmsanalyzed.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://algorithmsanalyzed.blogspot.com/feeds/7351861607087254236/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=8592569895405724581&amp;postID=7351861607087254236' title='1 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/8592569895405724581/posts/default/7351861607087254236'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/8592569895405724581/posts/default/7351861607087254236'/><link rel='alternate' type='text/html' href='http://algorithmsanalyzed.blogspot.com/2008/02/50000-algorithm.html' title='The $50,000 algorithm'/><author><name>Larry Freeman</name><uri>http://www.blogger.com/profile/06906614246430481533</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://4.bp.blogspot.com/_wqJeUjTB5sE/SKZUgwi8H5I/AAAAAAAABJ0/miQr_4wF1eg/S220/picture.jpg'/></author><thr:total>1</thr:total></entry><entry><id>tag:blogger.com,1999:blog-8592569895405724581.post-4780417287981276449</id><published>2008-02-20T09:12:00.000-08:00</published><updated>2008-02-27T23:22:39.546-08:00</updated><title type='text'>Purpose of this blog</title><content type='html'>Last year, I was fortunate enough to hear the Mark Zuckerberg, the CEO of Facebook, speak at a &lt;a href="http://startupschool.org/"&gt;conference&lt;/a&gt; that was organized by &lt;a href="http://www.paulgraham.com/bio.html"&gt;Paul Graham&lt;/a&gt;.  One of the many points that he made was that for every position in his company, he wanted to hire a software engineer.  The reasoning for this is in the internet marketplace, nimbleness is everything.&lt;br /&gt;&lt;br /&gt;Mark explained that if everyone is a software engineer, then customer service can be more responsive, marketing will understand the technology being promoted, and product will understand deeply what is possible and what is not.&lt;br /&gt;&lt;br /&gt;But what does this really mean?  What is a software engineer?&lt;br /&gt;&lt;br /&gt;Once upon a time, a software engineer was someone who coded.  It was someone who you gave programming problems to, they created code, and they showed what they could do.  Sometimes, there were math problems.  All that has changed.  Now, everything comes down to expert knowledge of algorithms and data structures.  If you don't speak fluent O-notation, you may have trouble getting your next job at the technology companies in the forefront.&lt;br /&gt;&lt;br /&gt;From my view, this blog is really an excuse to dive deep into algorithm and data structure analysis.  I had previously started a math blog which to my surprise has been very well received.&lt;br /&gt;&lt;br /&gt;I think that it makes sense for me to start a new blog focused on the mathematics of software engineering.  Software engineering is rapidly changing.  I don't know if in the technology company, everyone will be a software engineer.  I do know that as time goes on, the professionalism and algorithmic expertise required for the engineering job will be raised to higher and higher levels.&lt;br /&gt;&lt;br /&gt;As I did in my &lt;a href="http://fermatslasttheorem.blogspot.com/"&gt;math blog&lt;/a&gt;, I will use this blog mostly as a reference for the classic works in software engineering.  I will cover selections from the following works:&lt;br /&gt;&lt;br /&gt;&lt;ul&gt;&lt;li&gt;Donald Knuth, &lt;span style="font-style: italic;"&gt;&lt;a href="http://www.amazon.com/gp/redirect.html?ie=UTF8&amp;amp;location=http%3A%2F%2Fwww.amazon.com%2FArt-Computer-Programming-Volumes-Boxed%2Fdp%2F0201485419%3Fie%3DUTF8%26s%3Dbooks%26qid%3D1203758870%26sr%3D8-1&amp;amp;tag=themovieadvis-20&amp;amp;linkCode=ur2&amp;amp;camp=1789&amp;amp;creative=9325"&gt;The Art of Computer Programming&lt;/a&gt;&lt;img src="http://www.assoc-amazon.com/e/ir?t=themovieadvis-20&amp;amp;l=ur2&amp;amp;o=1" alt="" style="border: medium none  ! important; margin: 0px ! important;" border="0" height="1" width="1" /&gt;&lt;/span&gt;&lt;/li&gt;&lt;li&gt;Robert Sedgewick, &lt;span style="font-style: italic;"&gt;&lt;a href="http://www.amazon.com/gp/redirect.html?ie=UTF8&amp;amp;location=http%3A%2F%2Fwww.amazon.com%2FBundle-Algorithms-Java-Third-Parts%2Fdp%2F0201775786%3Fie%3DUTF8%26s%3Dbooks%26qid%3D1203759033%26sr%3D1-3&amp;amp;tag=themovieadvis-20&amp;amp;linkCode=ur2&amp;amp;camp=1789&amp;amp;creative=9325"&gt;Algorithms&lt;/a&gt;&lt;img src="http://www.assoc-amazon.com/e/ir?t=themovieadvis-20&amp;amp;l=ur2&amp;amp;o=1" alt="" style="border: medium none  ! important; margin: 0px ! important;" border="0" height="1" width="1" /&gt;&lt;/span&gt;&lt;/li&gt;&lt;li&gt;Thomas H. Cormen et al, &lt;span style="font-style: italic;"&gt;&lt;a href="http://www.amazon.com/gp/redirect.html?ie=UTF8&amp;amp;location=http%3A%2F%2Fwww.amazon.com%2FIntroduction-Algorithms-Thomas-H-Cormen%2Fdp%2F0262032937%3Fie%3DUTF8%26s%3Dbooks%26qid%3D1203759297%26sr%3D1-2&amp;amp;tag=themovieadvis-20&amp;amp;linkCode=ur2&amp;amp;camp=1789&amp;amp;creative=9325"&gt;Algorithms&lt;/a&gt;&lt;img src="http://www.assoc-amazon.com/e/ir?t=themovieadvis-20&amp;amp;l=ur2&amp;amp;o=1" alt="" style="border: medium none  ! important; margin: 0px ! important;" border="0" height="1" width="1" /&gt;&lt;/span&gt;&lt;/li&gt;&lt;li&gt;Eric Gamma et al, &lt;span style="font-style: italic;"&gt;&lt;a href="http://www.amazon.com/gp/redirect.html?ie=UTF8&amp;amp;location=http%3A%2F%2Fwww.amazon.com%2FDesign-Patterns-Object-Oriented-Addison-Wesley-Professional%2Fdp%2F0201633612%3Fie%3DUTF8%26s%3Dbooks%26qid%3D1203759381%26sr%3D1-1&amp;amp;tag=themovieadvis-20&amp;amp;linkCode=ur2&amp;amp;camp=1789&amp;amp;creative=9325"&gt;Design Patterns&lt;/a&gt;&lt;img src="http://www.assoc-amazon.com/e/ir?t=themovieadvis-20&amp;amp;l=ur2&amp;amp;o=1" alt="" style="border: medium none  ! important; margin: 0px ! important;" border="0" height="1" width="1" /&gt;&lt;/span&gt;&lt;br /&gt;&lt;/li&gt;&lt;/ul&gt;&lt;br /&gt;For more recent and interesting progress in computer science, I will also cover the following works:&lt;br /&gt;&lt;br /&gt;&lt;ul&gt;&lt;li&gt;Diomidis Spinellis, &lt;span style="font-style: italic;"&gt;&lt;a href="http://www.amazon.com/gp/redirect.html?ie=UTF8&amp;location=http%3A%2F%2Fwww.amazon.com%2FCode-Reading-Perspective-Effective-Development%2Fdp%2F0201799405%3Fie%3DUTF8%26s%3Dbooks%26qid%3D1204183187%26sr%3D8-1&amp;tag=themovieadvis-20&amp;linkCode=ur2&amp;camp=1789&amp;creative=9325"&gt;Code Reading&lt;/a&gt;&lt;img src="http://www.assoc-amazon.com/e/ir?t=themovieadvis-20&amp;amp;l=ur2&amp;amp;o=1" width="1" height="1" border="0" alt="" style="border:none !important; margin:0px !important;" /&gt;&lt;/span&gt;&lt;/li&gt;&lt;li&gt;Amy N. Langville et al, &lt;span style="font-style: italic;"&gt;&lt;a href="http://www.amazon.com/gp/redirect.html?ie=UTF8&amp;location=http%3A%2F%2Fwww.amazon.com%2FGoogles-PageRank-Beyond-Science-Rankings%2Fdp%2F0691122024%3Fie%3DUTF8%26s%3Dbooks%26qid%3D1204183270%26sr%3D1-1&amp;tag=themovieadvis-20&amp;linkCode=ur2&amp;camp=1789&amp;creative=9325"&gt;Google PageRank and Beyond&lt;/a&gt;&lt;img src="http://www.assoc-amazon.com/e/ir?t=themovieadvis-20&amp;amp;l=ur2&amp;amp;o=1" width="1" height="1" border="0" alt="" style="border:none !important; margin:0px !important;" /&gt;&lt;/span&gt;&lt;/li&gt;&lt;/ul&gt;In between, I hope to cover important snapshots from the current work of technology including:&lt;br /&gt;&lt;ul&gt;&lt;li&gt;Javascript/Ajax&lt;/li&gt;&lt;li&gt;Ruby/Rails&lt;/li&gt;&lt;li&gt;Python&lt;/li&gt;&lt;li&gt;Java/Spring/Open Source Projects&lt;/li&gt;&lt;li&gt;C#/.Net&lt;/li&gt;&lt;/ul&gt;At all times, feel free to post your comments and questions.&lt;br /&gt;&lt;br /&gt;-Larry&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8592569895405724581-4780417287981276449?l=algorithmsanalyzed.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://algorithmsanalyzed.blogspot.com/feeds/4780417287981276449/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=8592569895405724581&amp;postID=4780417287981276449' title='1 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/8592569895405724581/posts/default/4780417287981276449'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/8592569895405724581/posts/default/4780417287981276449'/><link rel='alternate' type='text/html' href='http://algorithmsanalyzed.blogspot.com/2008/02/purpose-of-this-blog.html' title='Purpose of this blog'/><author><name>Larry Freeman</name><uri>http://www.blogger.com/profile/06906614246430481533</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://4.bp.blogspot.com/_wqJeUjTB5sE/SKZUgwi8H5I/AAAAAAAABJ0/miQr_4wF1eg/S220/picture.jpg'/></author><thr:total>1</thr:total></entry></feed>
