Okay, so we’re not going to be matching the one million dollar prize winning algorithm that Netflix is based on, but I will be showing you how to code your own movie recommendation engine from the ground up. This means we will take some data on how people rate films and try and find preference relationships between these people. Based on the strength of these relationships we can than apply some weighted scores to work out what films individual people may well like.

You can get the full source code for this project Bitbucket.

## Collaborative filtering in movie recommender engines

The system we’ll be building is based on a collaborative filtering system. Collaborative filtering is something you’ve been doing yourself “manually” your whole life. In the case of what we’re looking at; films, there are certain people you’ve observed that have a similar taste in films to you. You may take their opinions more seriously that than of somebody you know you have very difference tastes from when it comes to recommending what film to watch next. A collaborative filtering based recommendation engine allows you to do this at a much larger scale, with an accuracy that was not previously possible.

## Collecting the data

Before we begin, we need a data-set to work with. Rather than copy and pasting my data, it can be more fun to make a survey with Google Docs and gather data from family and friends, the results will be more interesting!.

I picked 20 films from the IMDB Top 250 rated films and asked my colleagues via a Google form to rate these films on a scale of 1 (hated it) to 10 (loved it), with instructions just to leave blank any film they hadn’t seen.

If you were doing this on a large scale, it would make sense to use a database, but since we’re only working with a tiny amount of data, we’ll just use nested Python dictionaries.

#### Create recommendations.py and add your critic data

Feel free to use the data below which has a dictionary called critics which then has each critic’s name as a dictionary containing a dictionary of films and their rating of that film.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 |
critics={ 'Mark': {'Forrest Gump': 7, 'The Matrix': 9, 'Saving Private Ryan': 9, 'Gladiator': 8, 'The Lion King': 6, 'Into The Wild': 9, 'Finding Nemo': 6, 'Sin City': 4, 'Jaws': 7, 'Groundhog Day': 7, 'Fight Club': 10, 'The Silence of the Lambs': 7, 'Leon': 7, 'Terminator 2': 8, 'Aliens': 9, 'Braveheart': 6, 'Full Metal Jacket': 6, 'The Big Lebowski': 3, 'Men in Black': 6}, 'Tom': {'Forrest Gump': 7, 'The Matrix': 5, 'Saving Private Ryan': 8, 'Gladiator': 6, 'The Lion King': 7, 'Finding Nemo': 8, 'Sin City': 3, 'Jaws': 4, 'Groundhog Day': 6, 'Fight Club': 10, 'The Silence of the Lambs': 4, 'Leon': 4, 'Terminator 2': 2, 'Aliens': 5, 'Braveheart': 3, 'Full Metal Jacket': 8, 'The Big Lebowski': 5, 'Men in Black': 4}, 'Katy': {'Gladiator': 6, 'The Lion King': 7, 'Into The Wild': 10, 'Finding Nemo': 7, 'Sin City': 3, 'Fight Club': 6, 'The Silence of the Lambs': 4, 'Terminator 2': 6, 'Aliens': 7, 'Guardians of the Galaxy': 6,}, 'Seb': {'Forrest Gump': 7, 'The Matrix': 5, 'Saving Private Ryan': 9, 'The Lion King': 5, 'Fight Club': 8, 'Terminator 2': 9, 'Aliens': 8, 'The Big Lebowski': 8, 'Men in Black': 3}, 'Richard': {'Forrest Gump': 8, 'The Matrix': 9, 'Saving Private Ryan': 6, 'The Lion King': 6, 'Finding Nemo': 5, 'Sin City': 5, 'Jaws': 6, 'Fight Club': 7, 'The Silence of the Lambs': 7, 'Terminator 2': 9, 'Aliens': 8, 'Full Metal Jacket': 9, 'Men in Black': 6}, 'Sally': {'Forrest Gump': 7, 'The Matrix': 9, 'Saving Private Ryan': 10, 'Gladiator': 9, 'Jaws': 5, 'Groundhog Day': 6, 'Fight Club': 7, 'The Silence of the Lambs': 9, 'Terminator 2': 7, 'Aliens': 8, 'Braveheart': 9, 'Full Metal Jacket': 8, 'The Big Lebowski': 7, 'Men in Black': 4}, 'Victoria': {'Forrest Gump': 9, 'The Lion King': 7, 'Jaws': 5, 'Fight Club': 7}, 'Emily': {'Forrest Gump': 9, 'The Matrix': 7, 'Finding Nemo': 7, 'Jaws': 7, 'Groundhog Day': 5, 'Fight Club': 9, 'The Silence of the Lambs': 7, 'Leon': 6, 'Aliens': 6, 'Men in Black': 7}, 'Jamie': {'Forrest Gump': 7, 'The Matrix': 6, 'The Lion King': 5, 'Finding Nemo': 8, 'Jaws': 7, 'Fight Club': 9, 'Terminator 2': 7, 'Men in Black': 8}, 'Ollie': {'Forrest Gump': 7, 'The Matrix': 7, 'Saving Private Ryan': 8, 'Gladiator': 7, 'Into The Wild': 6, 'Finding Nemo': 7, 'Sin City': 7, 'Jaws': 8, 'Groundhog Day': 7, 'Fight Club': 8, 'The Silence of the Lambs': 7, 'Leon': 7, 'Terminator 2': 5, 'Aliens': 8, 'Braveheart': 6, 'Full Metal Jacket': 8, 'Men in Black': 6}, 'Matt': {'The Matrix': 7, 'Finding Nemo': 9, 'Groundhog Day': 6, 'Terminator 2': 9, 'Guardians of the Galaxy': 9, 'Men in Black': 6}, 'Shaun': {'Forrest Gump': 6, 'The Matrix': 8, 'Saving Private Ryan': 9, 'The Lion King': 7, 'Finding Nemo': 7, 'Jaws': 5, 'Groundhog Day': 7, 'Fight Club': 9, 'The Silence of the Lambs': 6, 'Terminator 2': 5, 'Braveheart': 7, 'The Big Lebowski': 3, 'Guardians of the Galaxy': 6, 'Men in Black': 5}, 'Owen': {'Forrest Gump': 4, 'The Matrix': 4, 'Saving Private Ryan': 5, 'The Lion King': 5, 'Into The Wild': 6, 'Finding Nemo': 5, 'Sin City': 7, 'Groundhog Day': 5, 'Fight Club': 7, 'The Silence of the Lambs': 5, 'Full Metal Jacket': 7, 'Guardians of the Galaxy': 7, 'Men in Black': 5}, } |

## Finding similar users

Before we can begin to think about recommending films, our first job is to find out which users have similar tastes to each other. This is the basis of our collaborative filtering; the assumption with if Person A has scored a lot of films similar to Person B, then if Person A likes a film Person B hasn’t seen, there is a above average chance that Person A will also like this film.

There are several ways to go about calculate what you might consider to be a similar user. Although we are not going to use it in our final engine because of drawbacks that we’ll discuss in a moment, it’s worth making a small diversion to discuss Euclidean distance score.

## Euclidean Distance Score Function

This is a quick and dirty way to see how similar two users scores are. Euclidean distance score will take the items that people have ranked in common and uses them for axes of a chart and plots them in “preference space”.

For instance, let’s see how Jamie and Emily’s rankings of Forrest Gump and The Matrix look:

To calculate the distance between Jamie and Emily on the chart, take the difference on each axis, square them and add them together. Then take the square root of the sum.

We can do this from Windows Powershell by running Python and using the **pow(n,2)** to square a number and **sqrt** to take the square root.

1 2 3 |
from math import sqrt sqrt(pow(7-9,2)+pow(6-7,2)) >2.23606797749979 |

This gives us the distance between the two points, which will be a smaller number when there is a closer match. Ideally, we would like a function that gives higher values for people who are similar. This can be done by adding 1 to the function (so we avoid dividing by zero) and investing it:

1 2 |
1/(1+sqrt(pow(7-9,2)+pow(6-7,2))) >0.3090169943749474 |

#### Add Euclidean function to recommendations.py

Let’s add a function to our code to allow us to find the similarity score based on this working.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 |
from math import sqrt # Returns a distance-based similarity score for person1 and person2 # Euclidean distance score def sim_distance(prefs,person1,person2): # Get the list of shared_items si={} for item in prefs[person1]: if item in prefs[person2]: si[item]=1 # if they have no rating in common, return 0 if len(si)==0: return 0 # Add up the squares of all the differences sum_of_squares=sum([pow(prefs[person1][item]-prefs[person2][item],2) for item in si]) return 1/(1+sqrt(sum_of_squares)) |

We can now use this function with two names to get a similarity score. Running your shell from the directory of your source code, try the following:

1 2 3 4 |
import recommendations reload(recommendations) recommendations.sim_distance(recommendations.critics,'Mark','Tom') >0.08121030314161228 |

While this can provide us with some interesting results, which we can compare with other methods later, there are some drawbacks to using Euclidean distance scoring; especially in cases where the data you’re using isn’t well normalised.

For instance, in this case our Euclidean distance score could give the impression that two people with very similar tastes in films are actually quite far apart depending on their own interpretation of the 1 to 10 system.

If Jamie thinks that a “great” film is worth about a 7 and a “bad” film is worth a 2, whereas Emily thinks a “great” film is worth a “10” and a “bad” film is wroth a 4, then despite them feeling the same about the films they are rating, the Euclidean distance score in that case will show them as far apart.

## Pearson Correlation Score

A better suited method to find how close the preferences of two peoples’ film tastes are, based on the data we have would be to use Pearson correlation score (or Pearson correlation coefficient as it is also known). The correlation coefficient is a measure of how well two sets of data fit on a straight line.

This method, which produces a “best fit” line allows us to correct for the grade inflation problem we just highlighted. Our Pearson coefficient will give a result of between -1 and 1. With 1 being a “perfect” correlation, where the line intersects all points, 0 between that there is no correlation between the data points and -1 being a perfect negative correlation, meaning critic A hates everything that critic B loves.

#### Add Pearson function to recommendations.py

The formula we will be putting into our function is as follows:

N = number of pairs of scores

∑xy = sum of the products of paired scores

∑x = sum of x scores

∑y = sum of y scores

∑x² = sum of squared x scores

∑y² = sum of squared y scores

Following this formula, let’s create a new function:

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 |
# Returns the Pearson correlation coefficient for p1 and p2 def sim_pearson(prefs,p1,p2): # Get the list of mutually rated items si={} for item in prefs[p1]: if item in prefs[p2]: si[item]=1 # Find the number of elements n= float(len(si)) # If they have no ratings in common, return 0 if n==0: return 0 # Add up all the preferences sum1=sum([prefs[p1][it] for it in si]) sum2=sum([prefs[p2][it] for it in si]) # Sum up the squares sum1Sq=sum([pow(prefs[p1][it],2) for it in si]) sum2Sq=sum([pow(prefs[p2][it],2) for it in si]) # Sum up the products pSum=sum([prefs[p1][it]*prefs[p2][it] for it in si]) # Calculate Pearson score num=pSum-(sum1*sum2/n) den=sqrt((sum1Sq-pow(sum1,2)/n)*(sum2Sq-pow(sum2,2)/n)) if den==0: return 0 r=num/den return r |

With this function in place, we can now compare how similar two specified critics are:

1 2 3 |
reload(recommendations) recommendations.sim_pearson(recommnendations.critics,'Mark','Tom') >0.3179866115511014 |

#### Add a ranking function to recommendations.py

Now we can find out similar two specific people are, the next step is to add the ability to compare one to all.

Our function will use Python *list comprehension* to compare me to every other user in the dictionary using one of our distance metrics and it will return the first *n* items of sorted results.

1 2 3 4 5 6 7 8 9 10 |
# Returns the best matches for person from the prefs dictionary # Number of results and similarity function are optional params def topMatches(prefs,person,n=10,similarity=sim_pearson): scores=[(similarity(prefs,person,other),other) for other in prefs if other!=person] # Sort the list so highest scores appear at the top scores.sort() scores.reverse() return scores[0:n] |

You can now try calling this function and get the top 3 matches to that person:

1 2 3 |
reload(recommendations) recommendations.topMatches(recommnendations.critics,'Mark',n=3) >[(0.765253227940204, 'Shaun'), (0.539873454627286, 'Katy'), (0.4687546711276978, 'Richard')] |

Even at this very early stage, we are getting some interesting data! I ran this function for all users that took part and compiled the data by colour, with negative correlations between red, ranging up to positive correlations being green.

With the dataset being so small, it is not uncommon for outliers to crop up and upset things. However, the value of this data over time increases and would allow you to spot trends that are occurring within these groups.

#### Add a recommendation function to recommendations.py

Our final step is adding our recommendation function that will hopefully help us pick films that a specified user will like. There are several ways we could accomplish this. Your first thought may be to simply look for the person(s) with the closet correlation and recommend films they have seen that the specific users hasn’t, but this can cause problems.

It may be the case that your closest matched critic rated a film highly that was hated by almost everyone else or they haven’t reviewed some of the movies that you might like. To solve these kinds of potential issues, you need to score items by producing a *weighted score*. This means modifying the ratings critics give films depending on how closely correlated they are in tastes to you.

Let’s look at an example of how we would score two films for Mark: The Matrix & Forrest Gump

The table shows the correlation scores (*similarity*) and the 1-10 rating each of those critics gave the film (*The Matrix* and *Forrest Gump*). The *S*Matrix* and *S*Forrest Gump* columns indicate the weighted score, that is the score after it has been multiplied by that critic’s similarity. This means that critics that have closely related film tastes are taken as more important, whereas critics with little correlation have the importance of their ratings dampened. As you can imagine, any critics with a negative correlation will actually lower the ranking of the film, the higher the rating they give.

The next step is to add up all of these weighted scores and take care of the issue of the films that simply have more reviews being scored higher. For this reason, we take the sum of all the weight scores and divide it by the sum of all of the similarity scores (*Sim. Sum*). This produces our final film scoring.

We can now implement this weighted scoring into our recommendation engine, providing us with a list of movies we may like to watch next, ignoring ones we have already seen.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 |
# Gets recommendations for a person by using a weighted average of every other user's rankings def getRecommendations(prefs,person,similarity=sim_pearson): totals={} simSums={} for other in prefs: # don't compare to self if other==person: continue sim=similarity(prefs,person,other) # ignore scores of zero or lower if sim<=0:continue for item in prefs[other]: # only score movies I haven't seen yet if item not in prefs[person] or prefs[person][item]==0: # Similarity * score totals.setdefault(item,0) totals[item]+=prefs[other][item]*sim #Sum of similarities simSums.setdefault(item,0) simSums[item]+=sim # Create the normalised list rankings=[(total/simSums[item],item) for item,total in totals.items()] # Return the sorted list rankings.sort() rankings.reverse() return rankings |

This code loops through every other person in the prefs dictionary. In each case, it calculates how similar they are to the specified person. It will then loop through every item for which they’ve given a score and calculate a final score (line 18). The scores are then normalised by diving each of them by the similarity sum and the sorted results returned.

## Congratulations!

### Test your recommendation engine

1 2 3 |
reload(recommendations) recommendations.getRecommendations(recommendations.critics,'Emily') >[(9.0, 'Terminator 2'), (9.0, 'Guardians of the Galaxy')] |

As a cool side-effect, not only do we get a ranked list of movies, we get a prediction at what that person would actually rate those movies as well. This information can be fed back into the system to make further improvements. For instance, you could specify that if the movies lined up for recommendation are less than a predicted 5/10 in ranking, then don’t show them.

This kind of recommendation engine can be adapted and used for many different types of data or similarity metrics, experiment and enjoy!

You can find this project and many more in the great book Programming Collective Intelligence.