This is my first post on machine learning in R.
It follows Chapter 3 of Machine Learning in R by Brett Lantz. This book is an excellent introduction to machine learning and using it in R. It has lots of reproducible examples of the different methods of machine learning using freely available data that you can find on the internet.
What is machine learning?
Lantz defines machine learning as:
[t]he development of computer algorithms to transform data into intelligent action
He also identifies four components of learning in general, which is applicable to humans and to machines
- Data collection
- Abstraction
- Application (he uses the term ‘generalization’)
- Evaluation
Let’s briefly look at these in turn with a human example, using English proficiency as an example:
Data collection
Most people around the world come from countries where English is not their native language, yet many speak it as a second language. The average native English speaker will likely meet at least some people who speak English as a second language during their lives. How many depends on factors such as where our native speaker lives, who his friends are and what he does for a living.
Abstraction
Our native speaker begins to create a mental model of how good someone’s English is likely to be from a non-English speaking country, ranging from non-existent to fluent. His model may include factors such as:
- Age
- Current location
- Nationality
- Occupation
- Education level
He is probably not consciously aware he is doing this most of the time.
Application
Our native speaker takes a holiday to Rome. He stops in a coffee shop and wonders whether the barista will speak English. His mental model tells him that she works in customer service, meaning it is likely she will have interacted in English with non-Italian tourists. She is also young, and in his experience younger people tend to have better English than older people thanks to higher exposure to American films and media. He also is in Rome, the capital of Italy, which will likely have a higher proportion of foreigners than a village in Tuscany. He decides on this basis it’s likely this woman will speak English, so he asks for a coffee in his native language.
Evaluation
As he sits down with his cappuccino, he ponders the conversation he has just had. The barista in fact spoke English very well, making conversation with him as his coffee brewed. His model served him well, which gives him confidence for when he looks for a place to have lunch.
Now let’s apply this model in R.
Machine learning in R
We are going to try to work out the positions of Premier League football players based on their statistics in the 2018/19 season.
Our hypothesis here is that certain statistics reveal something about the role a player has on the pitch. For example, a defender will likely make more tackles than a forward, but a forward will generally score more goals than a defender.
We are going to use nearest neighbour analysis. This method groups observations and aims to guess the category an observation belongs to based on the properties of its neighbouring points.
For example, if you live on a street where the average income for all the other houses is £100,000 or more per year, the chances are you probably earn more than £100,000 a year as well, otherwise you probably wouldn’t be able to afford to live there.
For another example, running this code on the pre-loaded mtcars dataset produces this chart:
mtcars$cyl[7] <- NA ggplot(mtcars,aes(x=wt,y=mpg,col=factor(cyl))) + geom_point()
If you had to guess, how many cylinders do you think the grey dot has: four, six or eight?
Most people would look at that chart and immediately (and correctly) guess eight cylinders because the grey dot is surrounded by blue eight cylinder dots, with no red dots anywhere near and the nearest green ones still some way off.
In other words you base your guess on the properties of the nearby dots – its neighbours.
This is very simple nearest neighbour analysis. R simply does this at a more advanced level with more variables.
Here is the full code:
library(tidyverse) library(class) library(gmodels) library(fplscrapR) #get season data data_2018_19 <- getplayerdetails1819 #get totals for season data_2018_19_by_player <- data_2018_19 %>% group_by(playername) %>% summarise(minutes = sum(minutes), fouls = sum(fouls), errors = sum(errors_leading_to_goal), threat = sum(threat), goals_scored = sum(goals_scored), creativity = sum(creativity), cbi = sum(clearances_blocks_interceptions), assists = sum(assists), dribbles = sum(dribbles), key_passes = sum(key_passes), offside = sum(offside), open_play_crosses = sum(open_play_crosses), tackles = sum(tackles)) #add in player positions players <- read.csv('players.csv', stringsAsFactors = FALSE) player_data <- merge(players,data_2018_19_by_player, by.x = 'Player', by.y = 'playername') #remove goalkeepers player_data <- player_data[player_data$Position != 'Goalkeeper',] table(player_data$Position) #remove players with fewer than 7 games player_data <- player_data[player_data$minutes >= 630,] player_data$minutes <- NULL #make positions a factor player_data$Position <- factor(player_data$Position, levels = c('Defender','Midfielder','Forward')) #randomize order set.seed(4) player_data <- player_data[sample(nrow(player_data)),] #reset row.names row.names(player_data) <- seq(1,nrow(player_data),1) #normalize function normalize <- function(x) { return ((x - min(x)) / (max(x) - min(x))) } #apply normalize function player_data_n <- as.data.frame(lapply(player_data[4:13], normalize)) #separate into training and test player_data_train <- player_data_n[1:194,] player_data_test <- player_data_n[195:nrow(player_data_n),] #get positions player_train_labels <- player_data[1:194, 2] player_test_labels <- player_data[195:nrow(player_data), 2] #work out k (13.9 in this case so 14) round(sqrt(nrow(player_data_train)), digits = 0) #run player_test_pred <- class::knn(train = player_data_train, test = player_data_test, cl = player_train_labels, k = 14) #evaluate gmodels::CrossTable(x = player_test_labels, y = player_test_pred, prop.chisq = FALSE) #look at outliers player_test_names <- data.frame(name = player_data[195:nrow(player_data), 1], position = player_data[195:nrow(player_data), 2]) player_test_names$predicted_position <- player_test_pred player_test_names$correct <- player_test_pred == player_test_labels
Step 1: Gather the data
library(tidyverse) library(class) library(gmodels) library(fplscrapR) #get season data data_2018_19 <- getplayerdetails1819 #get totals for season data_2018_19_by_player <- data_2018_19 %>% group_by(playername) %>% summarise(minutes = sum(minutes), fouls = sum(fouls), errors = sum(errors_leading_to_goal), threat = sum(threat), goals_scored = sum(goals_scored), creativity = sum(creativity), cbi = sum(clearances_blocks_interceptions), assists = sum(assists), dribbles = sum(dribbles), key_passes = sum(key_passes), offside = sum(offside), open_play_crosses = sum(open_play_crosses), tackles = sum(tackles)) #add in player positions players <- read.csv('players.csv', stringsAsFactors = FALSE) player_data <- merge(players,data_2018_19_by_player, by.x = 'Player', by.y = 'playername')
We are going to use the excellent fplscrapR package by Rasmus Christensen to get player performance data for 2018/19.
It has a pre-loaded data frame which we can assign as data_2018_19_by_player.
This data runs week by week for the 38 weeks of last season. We will summarise this data for the following variables for the entire season:
- minutes played
- fouls conceded
- errors leading to goals
- threat (on goal)
- goals scored
- creativity
- clearances, blocks and interceptions (CBI)
- assists
- dribbles
- offsides
- key passes
- open play crosses
- tackles
Unfortunately this data doesn’t have player positions, which is what we are trying to work out.
The best way I could find to get player positions was from the official Premier League website. I put the data in a Google spreadsheet that you can download here. I advise opening the document in OpenOffice or another program and saving it with UTF-8 encoding to keep the special characters such as Sergio Agüero.
We don’t have common player IDs unfortunately so we will have to match up players to positions by name. For the highest accuracy, we need Sergio Agüero rather than Sergio Agüero. It’s not perfect but it gives us a player_data dataset of 450 players.
Step 2: clean the data
#remove goalkeepers table(player_data$Position) player_data <- player_data[player_data$Position != 'Goalkeeper',] table(player_data$Position) #remove players with fewer than 7 games player_data <- player_data[player_data$minutes >= 630,] player_data$minutes <- NULL #make positions a factor player_data$Position <- factor(player_data$Position, levels = c('Defender','Midfielder','Forward'))
First of all we are going to remove goalkeepers from our dataset. There is a ‘saves’ field in the original data. As only goalkeepers can make saves, any player with more than one save over the entire season is almost certainly a goalkeeper (unless they did a Kyle Walker and filled in as an emergency keeper).
Secondly we are going to remove players that didn’t play more than 630 minutes (equivalent to seven full matches) over the course of the season to avoid skewing the data. After filtering for minutes the variable is no longer needed so we will delete it.
Lastly we will make the Position, the variable we want to predict, a factor.
Step 3: Randomise the order of the dataset
#randomize order set.seed(4) player_data <- player_data[sample(nrow(player_data)),] #reset row.names row.names(player_data) <- seq(1,nrow(player_data),1)
The data is loaded in alphabetical order. It’s very important that we take a random sample of data later on. The set.seed function allows us to reproduce our random sample later on should we wish. It will also mean my result should match yours at the end. Without set.seed, it will pick a different sample of players each time.
Step 4: Normalize
#normalize normalize <- function(x) { return ((x - min(x)) / (max(x) - min(x))) } #apply player_data_n <- as.data.frame(lapply(player_data[4:13], normalize))
Lantz introduces the normalize function. This essentially converts a range of numbers to between 0 and 1.
For example:
> normalize(c(1,2,3,4,5)) [1] 0.00 0.25 0.50 0.75 1.00
You can see here that the sequence 1 to 5 has been compressed proportionally into fractions between 0 and 1.
This is useful because some players have creativity ratings in excess of 1,000, but no player is ever going to score anything like 1,000 goals in a season.
We apply this function across the dataset for all the numeric values (columsn 4 to 13), which leaves us with 11 columns of data between 0 and 1 for our model to analyse.
Crucially we are removing the players’ names, positions and nationality. When running machine learning it is always a good idea to remove anything that can uniquely identify a record, such as an ID or a name. We are removing the nationality because it should have no bearing on the outcome and the position because that is what we are trying to model.
Step 4: Get our training and test datasets
#separate into training and test player_data_train <- player_data_n[1:194,] player_data_test <- player_data_n[195:nrow(player_data_n),] #get positions player_train_labels <- player_data[1:194, 2] player_test_labels <- player_data[195:nrow(player_data), 2]
Machine learning generally runs on training and test datasets. You train your model on the majority (in our case 80%) of the data and then apply your model to the remaining 20%.
We have 243 players in our player_data_n dataset, so 80% is approximately 194. We have already randomised the rows so we can take the first 194 records as our training dataset and we will test on the remaining 49.
We do need the actual player positions separately so R can test evaluate the accuracy of its prediction, which we are saving as player_train_labels and player_test_labels.
Step 5: Run the algorithm
#work out k (13.9 so 14) round(sqrt(nrow(player_data_train)), digits = 0) #run player_test_pred <- class::knn(train = player_data_train, test = player_data_test, cl = player_train_labels, k = 14)
The variable k is the number of neighbours the algorithm will include.
Lantz suggests using a number approximate to the square root of the number of observations in the training dataset, which in this case is 14.
The code here is running our model. It’s evaluating all our footballing metrics in player_data_train to try to guess the positions of the players in our player_data_test dataset.
Step 6: Evaluate
The final step is to evaluate our model. Here is a screenshot of our crosstab.
The top row shows that there were 22 defenders in our test data. Twenty of them are in the top left Defender square, meaning our model correctly identified 20 out of 22 defenders, a 90.9% success rate. It incorrectly classified two defenders as midfielders.
Our model also did well at predicting midfielders, with 16 out of 19 guessed correctly and three incorrectly classed as defenders, a 84.2% success rate.
However with forwards this dropped to a 62.5% rate, with five out of eight correct.
Overall our model correctly assigned 41 out of 49 players to their positions!
Bonus step: Identify outliers
#look at outliers player_test_names <- data.frame(name = player_data[195:nrow(player_data), 1], position = player_data[195:nrow(player_data), 2]) player_test_names$predicted_position <- player_test_pred player_test_names$correct <- player_test_pred == player_test_labels
We can compare our test results against the actual test data to see which ones were incorrectly predicted. Here is what happens:
> player_test_names[player_test_names$correct == FALSE,] name position predicted_position correct 11 Hélder Costa Forward Midfielder FALSE 19 Andrew Surman Midfielder Defender FALSE 20 Matteo Guendouzi Midfielder Defender FALSE 26 Ivan Cavaleiro Forward Midfielder FALSE 37 Romain Saïss Midfielder Defender FALSE 46 Xherdan Shaqiri Forward Midfielder FALSE 47 José Holebas Defender Midfielder FALSE 49 Lucas Digne Defender Midfielder FALSE
Interestingly our model is good enough that no defenders were classed as forwards or vice versa. It’s only with midfielders that errors creep in.
This makes sense because players who perform very different roles for their teams are classed as midfielders.
To take Leicester City as an example: both James Maddison and Wilfred Ndidi are classed as midfielders. Maddison is a creative number 10, tasked with creating chances for his teammates, which he does from open play and set pieces. Ndidi on the other hand acts a defensive shield in front of the back four, just as N’Golo Kanté did so famously when the Foxes won the title in 2015/16. We would therefore expect Ndidi to register more tackles than Maddison but Maddison to register higher creativity and more assists.
We can see that Matteo Guendouzi and Andrew Surman, both defensively-minded midfielders, are incorrectly classed as defenders for this reason.
On the other hand, anyone who watches Everton will know Lucas Digne is an extremely attacking full-back. He also takes corners and free-kicks for the Toffees, increasing his creativity and threat tallies beyond what you might expect from a defender, leading to him being incorrectly classed as a midfielder.
Conclusion
Nearest neighbour analysis is a relatively simple but effective machine learning method.
Our method was able to predict footballers’ playing positions with an 83.6% success rate.
It can be used for extremely important tasks – the example in Lantz’s book involves working out whether breast cancer tumours are cancerous or benign.
The impact of false negatives or false positives from a nearest neighbour analysis depends on what exactly you are trying to learn.
Our football example is relatively light-hearted, but if you are using it as the sole basis for detecting cancer, any false negatives or positives are going to mean that at least some patients are falsely told they don’t have cancer when in fact they do, or falsely given the dreadful news that they have cancer when in fact they don’t. Clearly the consequences of either would be extremely serious and often fatal.
Next steps would be to tweak our model with other variables or to retest it – perhaps using another season of data or trying again with a different seed (i.e. a different test group of players) to see how well it functions. It’s possible that our 83.6% success rate is itself a fluke outlier – the way to gain a clearer picture is to try again and see if we can repeat or even improve our score.