One of the most detailed UK government datasets I’ve ever seen is the STATS19 data from the Department for Transport.
This is the collated reports of road accidents in Great Britain in which at least one person was injured or killed.
It records all kinds of details, from the basics such as the date, time and location of the accident to how many vehicles were involved, what kinds of vehicles they were, the number of casualties, weather conditions and even whether the vehicle was left or right hand drive.
I’ve done some analysis on the 4,500+ injuries from accidents since 2010 that featured dodgy road signs or markings, which you can read here in the Bath Chronicle. It builds on work done before by my former colleague Patrick Scott.
Here is a flavour of my story:
Dodgy signs, traffic lights and road markings in Bath have been a factor in 21 people being injured since 2010.
New analysis of road-traffic data has revealed that one person was injured in accidents where there were was a broken road sign or marking, or one that couldn’t be seen properly.
There were another 20 injuries that involved traffic lights and other automatic road signs that weren’t working properly, or at all.
This made 21 casualties in Bath and North East Somerset between 2010 and 2015, exclusive Chronicle analysis of the Department for Transport’s road accidents data shows.
In Part One of this post, I’ll run through how I got the data from 12 spreadsheets covering 2010 to 2015 together.
The first thing to do is download and unzip all the files. Each year is split into three spreadsheets: accidents, casualties and vehicles. I didn’t include the vehicles in my analysis. They are all linked by a unique Accident Index identifier.
accidents2015 <- read.csv("Accidents_2015.csv") casualties2015 <- read.csv("Casualties_2015.csv") accidents2014 <- read.csv("DfTRoadSafety_Accidents_2014.csv") casualties2014 <- read.csv("DfTRoadSafety_Casualties_2014.csv") accidents2013 <- read.csv("DfTRoadSafety_Accidents_2013.csv") casualties2013 <- read.csv("DfTRoadSafety_Casualties_2013.csv") accidents2012 <- read.csv("DfTRoadSafety_Accidents_2012.csv") casualties2012 <- read.csv("DfTRoadSafety_Casualties_2012.csv") accidents2011 <- read.csv("DfTRoadSafety_Accidents_2011.csv") casualties2011 <- read.csv("DfTRoadSafety_Casualties_2011.csv") accidents2010 <- read.csv("DfTRoadSafety_Accidents_2010.csv") casualties2010 <- read.csv("DfTRoadSafety_Casualties_2010.csv")
This Accident Index is the key to my merge function.
I’m aware there is a bit of repetition in this code. I’ve read here about the reduce function that may be helpful for combining more than two datasets. I’ll stick with this method for now because the data frames will need further cleaning before we can combine them into one massive data frame from 2010 to 2015.
y2010 <- merge(x = accidents2010, y = casualties2010, by.x = 'Accident_Index', by.y = 'Acc_Index') y2011 <- merge(x = accidents2011, y = casualties2011, by.x = 'Accident_Index', by.y = 'Acc_Index') y2012 <- merge(x = accidents2012, y = casualties2012, by.x = 'Accident_Index', by.y = 'Acc_Index') y2013 <- merge(x = accidents2013, y = casualties2013, by.x = 'Accident_Index', by.y = 'Acc_Index') y2014 <- merge(x = accidents2014, y = casualties2014, by.x = 'ï..Accident_Index', by.y = 'ï..Accident_Index') y2015 <- merge(x = accidents2015, y = casualties2015, by.x = 'Accident_Index', by.y = 'Accident_Index')
For some reason the 2014 Accident Index column name is different from the others. We can use the following line of code to put it in line:
names(y2014)[names(y2014)=="ï..Accident_Index"] <- "Accident_Index"
Something else is not right here:
Looking at our new data frames in RStudio, there’s a problem:
Years 2010 to 2013 have 45 variables, whereas 2014 has 46 and 2015 has 47.
This will be no good for using rbind later on because the columns need to match up.
Calling str on the y2010 and y2014 reveals that in 2014 the DfT introduced the precise age of the casualty as a new variable. Before then we just had the age band of the casualty. You knew, for example, that a casualty was between 36 and 45 but not exactly how old they were. Everything else is in the same order.
To get around this problem, we can add in a blank column for the age of the casualty for the previous years using the append function:
y2010 <- as.data.frame(append(y2010, list(Age_of_Casualty = NA), after = 36)) y2011 <- as.data.frame(append(y2011, list(Age_of_Casualty = NA), after = 36)) y2012 <- as.data.frame(append(y2012, list(Age_of_Casualty = NA), after = 36)) y2013 <- as.data.frame(append(y2013, list(Age_of_Casualty = NA), after = 36))
We are telling R here: ‘Add a new blank column called ‘Age_of_Casualty’ as the 37th column in our data frame’. This column will just act as filler.
There’s still one more problem:
If you remember, our data frame y2015 has another extra variable. This is actually a very interesting one called ‘Casualty_IMD_Decile’.
This tells us how deprived the area from which the casualty (i.e. the injured or killed person) came from.
This is the final column in our data frame, so it’ll be a lot easier to add.
First of all, let’s get the other data frames together
We can use the rbind function on the years 2010 to 2014 to make them one data frame. Once that’s done, we can save ourselves some time by adding another blank column, but only doing it once.
Rbind works by joining datasets vertically, i.e. putting one underneath another. To make it work properly, you want to have the same column names in the same order with the same types of data in each column.
We now have that for the years 2010 to 2014, so let’s combine them:
y201014 <- rbind(y2010,y2011,y2012,y2013,y2014)
Now let’s add that final column and join 2015:
We need a blank final column for our new data frame. Because it’s at the end, we can do it simply like this:
y201014$Casualty_IMD_Decile <- NA #combine all years y201015 <- rbind(y201014,y2015)
We now have our data in one huge data frame!
All 1,172,657 rows by 48 columns of it.
Now let’s take the data we are interested in.
The variable we want is called ‘Special_Conditions_At_Site’. This covers a variety of things that could be wrong with the road on which the accident took place:
code | label |
0 | None |
1 | Auto traffic signal – out |
2 | Auto signal part defective |
3 | Road sign or marking defective or obscured |
4 | Roadworks |
5 | Road surface defective |
6 | Oil or diesel |
7 | Mud |
-1 | Data missing or out of range |
The data uses a number to save writing it out in full. We are interested in numbers one to three:
defective <- y201015[y201015$Special_Conditions_at_Site>=1&y201015$Special_Conditions_at_Site<=3, ]
Here we are saying: ‘Give me all the Special Conditions at Site greater than or equal to one and less than or equal to three’.
Now we are pretty much done!
We have a data frame called defective of 4,672 rows.
Each row corresponds to an injury. Merge will create a new row if there are duplicates in the joining columns. In our case, one accident (and ID) could have several casualties, each of which would have its own row.
We could use merge again to add in the local authorities of the accidents and to sub in the meanings of the special conditions.
But I’d rather not. Instead we can print off our new, more manageable table as a CSV and start analysing it in Part Two. We can do it just as easily there.
We can print the table as a CSV using write.table.
write.table(defective, file = "defective.csv")
Find the data here and rejoin me next time to see what it shows.
[top picture © Copyright Lewis Clarke and licensed for reuse under this Creative Commons Licence.