On Kaggle there is a Netflix Movies and TV Shows dataset available to download and work with.
This dataset contains almost 8,000 TV shows and films available on the streaming service. It includes information about the genre, cast, director, country and a summary of what the show is about.
I set myself one of the tasks associated with this dataset. This was to create a recommendation system.
The idea would be to name a show and an app would recommend some similar TV series and films to watch.
The code for this work is on GitHub.
Introducing the data
First download the data from the URL above and read it.
library(tidyverse)
library(tidytext)
library(shiny)
netflix <- read_csv('netflix_titles.csv')
str(netflix)
tibble [7,787 x 12] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
$ show_id : chr [1:7787] "s1" "s2" "s3" "s4" …
$ type : chr [1:7787] "TV Show" "Movie" "Movie" "Movie" …
$ title : chr [1:7787] "3%" "7:19" "23:59" "9" …
$ director : chr [1:7787] NA "Jorge Michel Grau" "Gilbert Chan" "Shane Acker" …
$ cast : chr [1:7787] "João Miguel, Bianca Comparato, Michel Gomes, Rodolfo Valente, Vaneza Oliveira, Rafael Lozano, Viviane Porto, Me"| truncated "Demián Bichir, Héctor Bonilla, Oscar Serrano, Azalia Ortiz, Octavio Michel, Carmen Beato" "Tedd Chan, Stella Chung, Henley Hii, Lawrence Koh, Tommy Kuan, Josh Lai, Mark Lee, Susan Leong, Benjamin Lim" "Elijah Wood, John C. Reilly, Jennifer Connelly, Christopher Plummer, Crispin Glover, Martin Landau, Fred Tatasc"| truncated …
$ country : chr [1:7787] "Brazil" "Mexico" "Singapore" "United States" …
$ date_added : chr [1:7787] "August 14, 2020" "December 23, 2016" "December 20, 2018" "November 16, 2017" …
$ release_year: num [1:7787] 2020 2016 2011 2009 2008 …
$ rating : chr [1:7787] "TV-MA" "TV-MA" "R" "PG-13" …
$ duration : chr [1:7787] "4 Seasons" "93 min" "78 min" "80 min" …
$ listed_in : chr [1:7787] "International TV Shows, TV Dramas, TV Sci-Fi & Fantasy" "Dramas, International Movies" "Horror Movies, International Movies" "Action & Adventure, Independent Movies, Sci-Fi & Fantasy" …
$ description : chr [1:7787] "In a future where the elite inhabit an island paradise far from the crowded slums, you get one chance to join t"| truncated "After a devastating earthquake hits Mexico City, trapped survivors from all walks of life wait to be rescued wh"| truncated "When an army recruit is found dead, his fellow soldiers are forced to confront a terrifying secret that's haunt"| truncated "In a postapocalyptic world, rag-doll robots hide in fear from dangerous machines out to exterminate them, until"| truncated …
attr(*, "spec")=
.. cols(
.. show_id = col_character(),
.. type = col_character(),
.. title = col_character(),
.. director = col_character(),
.. cast = col_character(),
.. country = col_character(),
.. date_added = col_character(),
.. release_year = col_double(),
.. rating = col_character(),
.. duration = col_character(),
.. listed_in = col_character(),
.. description = col_character()
.. )
Inspecting the data more closely we can see there are some NAs
in some columns. Some variables are character vectors containing more than one element, separated by a comma:
netflix$cast[1]
[1] "João Miguel, Bianca Comparato, Michel Gomes, Rodolfo Valente, Vaneza Oliveira, Rafael Lozano, Viviane Porto, Mel Fronckowiak, Sergio Mamberti, Zezé Motta, Celso Frateschi"
Principles of the recommendation engine
I’m going to feed the following variables into my engine to produce my recommendations:
- Director
- Cast
- Country
- Genre
- Description
An important point here is that we will need to separate character vectors out where necessary. We need all cast members, countries, words of the description and so on to make the most educated guesses possible.
Whenever we have a character vector of length >1, the elements must have their own rows in the data frame.
These functions will enable us to do that:
#define functions
spread_netflix <- function(show_id, var) {
a <- data.frame(show_id = show_id,
var = str_split(as.character(var),
pattern = ', '))
names(a) <- c('show_id','var')
a[is.na(a$var),2] <- ''
spread_df <<- rbind(spread_df, a)
}
desc <- data.frame()
spread_desc <- function(show_id, var) {
a <- data.frame(show_id = show_id,
desc = str_split(as.character(var), pattern = ' '))
names(a) <- c('show_id','word')
a$word <- gsub('[\\,.;:!?"]','',a$word)
a$word <- a$word %>% str_to_lower()
a <- anti_join(a, stop_words, by= 'word')
a$len <- str_length(a$word)
desc <<- rbind(desc, a)
}
These functions are taking the var
and splitting it, either using a comma as a separator or a space in the case of spread_desc()
.
The functions use spread()
from the tidyverse to turn wide data into long and concatenate it to a blank data frame outside the function.
The spread_desc()
function does some extra cleaning of the description data before sending it to the long data frame for processing. It removes punctuation to clean the data and uses the tidytext package to remove stop words. These are common words such as ‘from’ and ‘to’ that don’t provide us with any useful information about the show.
Get the cast
#get cast and main characters
spread_df <- data.frame()
mapply(netflix$show_id, FUN = spread_netflix, var = netflix$cast)
cast_members <- spread_df
cast_members$seq <- ave(cast_members$var, cast_members$show_id, FUN = seq_along)
cast_members$type <- 'cast_member'
main_characters <- cast_members[cast_members$seq == 1,]
main_characters <- main_characters[main_characters$var != '',]
This code applies the spread_netflix()
function we just created passing in the cast column as the variable.
It splits off the main characters from the rest of the cast, for reasons to which we will return later.
> head(cast_members)
show_id var seq type
1 s1 João Miguel 1 cast_member
2 s1 Bianca Comparato 2 cast_member
3 s1 Michel Gomes 3 cast_member
4 s1 Rodolfo Valente 4 cast_member
5 s1 Vaneza Oliveira 5 cast_member
6 s1 Rafael Lozano 6 cast_member
Now we have each cast member from the first show as their own row in the data frame. We have identified they are a cast_member
and preserved the link to the show through the show_id
.
#get countries
spread_df <- data.frame()
mapply(netflix$show_id, FUN = spread_netflix, var = netflix$country)
countries <- spread_df
countries$type <- 'country'
#get genres
spread_df <- data.frame()
mapply(netflix$show_id, FUN = spread_netflix, var = netflix$listed_in)
genres <- spread_df
genres$type <- 'genres'
#get directors
spread_df <- data.frame()
mapply(netflix$show_id, FUN = spread_netflix, var = netflix$director)
directors <- spread_df
directors$type <- 'director'
#get desc
desc <- data.frame()
mapply(netflix$show_id[1:3000], FUN = spread_desc, var = netflix$description[1:3000])
desc$type <- 'desc'
desc_all <- desc
desc <- data.frame()
mapply(netflix$show_id[3001:6000], FUN = spread_desc, var = netflix$description[3001:6000])
desc$type <- 'desc'
desc_all <- rbind(desc_all, desc)
desc <- data.frame()
mapply(netflix$show_id[6001:nrow(netflix)], FUN = spread_desc, var = netflix$description[6001:nrow(netflix)])
desc$type <- 'desc'
desc_all <- rbind(desc_all, desc)
names(desc_all)[2] <- 'var'
desc_all <- desc_all[desc_all$var != '–',]
This code gets the rest of the variable data and puts it into separate data frames. I split up the execution of the code for the description because it was failing on my machine due to a lack of memory.
Having run all this we have long data frames containing all the data we need to make our recommendations.
Identify matching shows
The next step is to write a function that matches a selection with the relevant shows.
get_matches <- function(show_id, df, boost) {
my_selection <- df[df$show_id == show_id,]
my_selection_chr <- my_selection$var
matching_titles <- subset(df, var %in% my_selection_chr)
#remove the option the user chose
matching_titles <- matching_titles[matching_titles$show_id != show_id,]
#remove blank options
matching_titles <- matching_titles[matching_titles$var != '',]
#award a score based on how closely it matches
matching_titles <- matching_titles %>%
group_by(show_id, var, type) %>%
summarise(count = n() + boost, .groups = 'keep')
}
This function takes three arguments: show_id
, df
and boost
.
The show_id
is the ID of the show that we have chosen. The df
is one of the data frames we have created in the steps above. The boost
is an extra score to be given to variables that I judge to be more important than others.
Explaining the boost
The get_matches()
function contains the boost
argument because some variables are more likely to get a stronger recommendation than others.
To demonstrate this, let’s take two films in the dataset: Ocean’s Twelve and Ocean’s Thirteen. If you enjoy Twelve, then logically you are highly likely to enjoy Thirteen and vice versa. Therefore we want our engine to recommend one based on the other.
Both films are both listed as American and come under the ‘Action & Adventure’ and ‘Comedies’ genres. There are 58 shows in the dataset that meet these criteria. Not bad, but there are some films in there that are nothing like our targets (You Don’t Mess With the Zohan, anyone?)
However, if we filter by director (Steven Soderbergh) and main character (George Clooney) we get just the two Ocean’s films.
Therefore we can reasonably conclude that a matching director and main character are strong predictors of a similar show to the viewer’s tastes.
Putting the boost
in as an argument means we can tweak exactly how important each variable is.
Suggest some matching titles
suggest_titles <- function(title) {
show_id <- netflix[netflix$title == title,1] %>% as.character()
match_countries <- get_matches(show_id, df = countries, boost = 0)
#assign a boost if the show is not American
match_countries$count <- ifelse(match_countries$var == 'United States', yes = match_countries$count, no = match_countries$count + 2)
match_desc <- get_matches(show_id, df = desc_all, boost = 2)
match_genres <- get_matches(show_id, df = genres, boost = 0)
match_directors <- get_matches(show_id, df = directors, boost = 5)
match_main_characters <- get_matches(show_id, df = main_characters, boost = 5)
match_cast <- get_matches(show_id, df = cast_members, boost = 2)
match_cast <- anti_join(match_cast,match_main_characters, by = 'show_id')
all_titles <- rbind(match_countries,
match_genres,
match_directors,
match_main_characters,
match_cast,
match_desc)
#check movie or TV
#all_titles_summ <- arrange(all_titles_summ, desc(tally))
all_titles <- merge(all_titles, netflix, by = 'show_id')
suggested_titles <- all_titles %>%
group_by(show_id) %>%
summarise(tally = sum(count), .groups = 'keep')
suggested_titles <- arrange(suggested_titles, desc(tally))
top_picks <- suggested_titles %>% slice_max(tally, n = 5)
top_picks <- merge(top_picks,netflix, by.x = 'show_id') %>% arrange(desc(tally))
top_picks <- top_picks[1:5,]
top_picks_df <- data.frame(title = paste0(top_picks$title,'\n'),
rank = paste0(seq(1,nrow(top_picks),1),'.'))
top_picks_df$both <- paste(top_picks_df$rank, top_picks_df$title, sep = ' ')
writeLines(paste0("You chose '",title,"'.\nBased on your choice we recommend: \n\n"))
writeLines(top_picks_df$both)
}
The final function takes the user’s choice, filters all the data frames for matches, assigns a score to each and sums the tallies to produce a final score.
I gave a matching director or a matching main character a top boost
of five. It gives a boost
of two if the show is not American. This isn’t because I dislike American films (I don’t) but because 42 per cent of the shows are wholly or partly American.
Knowing that the show is American doesn’t tell you an awful lot. However if your selection is one of the 88 that is from Brazil then that makes the choice of country much more important.
The top matches are then written to the console!
Testing it out
Let’s try it a few times:
> suggest_titles(title = "Casino Royale")
You chose 'Casino Royale'.
Based on your choice we recommend:
1. GoldenEye
2. Quantum of Solace
3. Tomorrow Never Dies
4. Die Another Day
5. Doom
This looks like a great recommendation. If you like Daniel Craig’s debut as Bond, then there are four more Bond films recommended to watch.
> suggest_titles(title = "Indiana Jones and the Temple of Doom")
You chose 'Indiana Jones and the Temple of Doom'.
Based on your choice we recommend:
1. Indiana Jones and the Last Crusade
2. Indiana Jones and the Raiders of the Lost Ark
3. Indiana Jones and the Kingdom of the Crystal Skull
4. Hook
5. K-19: The Widowmaker
Again, the engine looks to be working here. It recommends the other Indiana Jones films if you choose Temple of Doom.
Let’s try a TV show instead:
> suggest_titles('Greatest Events of WWII in Colour')
You chose 'Greatest Events of WWII in Colour'.
Based on your choice we recommend:
1. World War II in Colour
2. WWII in HD
3. A Bridge Too Far
4. The 12th Man
5. Churchill’s Secret Agents: The New Recruits
This World War II selection recommends a number of other shows to do with the Second World War as well as A Bridge Too Far, a WW2 dramatisation.
Discussion
I wrapped the code in a Shiny app and embedded it at the beginning of this post.
I went about this task slightly differently from how other data scientists may have done it. The standard way to build a recommendation engine is to use K-means clustering.
My approach is not dissimilar but I wouldn’t count it as machine learning as such because it doesn’t ‘learn’. I did the the learning and tweaking of the algorithm myself rather than the machine.
However, it seems to work well and that is what matters.