Recently the British Department for Transport published its latest STATS19 data for the year 2016.
We’ve looked at this data before.
To recap, each row of the STATS19 data is a traffic accident that caused injury or death, identified by a unique Accident Index.
It’s an extremely detailed dataset containing fields such as the latitude and longitude coordinates, the number of vehicles, light and weather conditions.
There are two other linked datasets providing more detail on the casualties and vehicles.
I decided to investigate which month was the worst for accidents. My hunch was that it might have been November with the clocks going back and worsening weather.
The first step is to get all the data into one data frame.
library(dplyr) library(tidyr) accidents2016 <- read.csv("Accidents_2016.csv") accidents2015 <- read.csv("Accidents_2015.csv") accidents2014 <- read.csv("DfTRoadSafety_Accidents_2014.csv") accidents2013 <- read.csv("DfTRoadSafety_Accidents_2013.csv") accidents2012 <- read.csv("DfTRoadSafety_Accidents_2012.csv") accidents2011 <- read.csv("DfTRoadSafety_Accidents_2011.csv") accidents2010 <- read.csv("DfTRoadSafety_Accidents_2010.csv") #make sure all the column names are the same or the rbind won't work names(accidents2014)[1] <- "Accident_Index" #combine them all together accidents <- rbind(accidents2010, accidents2011, accidents2012, accidents2013, accidents2014, accidents2015, accidents2016)
The STATS19 data works using codes
> str(accidents) 'data.frame': 1013118 obs. of 32 variables: $ Accident_Index : Factor w/ 1013118 levels "201001BS70003",..: 1 2 3 4 5 6 7 8 9 10 ... $ Location_Easting_OSGR : int 527580 525340 524800 525080 526760 528020 526360 523770 524300 526760 ... $ Location_Northing_OSGR : int 177730 180470 180300 180920 177740 177780 177420 181000 180070 177740 ... $ Longitude : num -0.164 -0.195 -0.203 -0.199 -0.176 ... $ Latitude : num 51.5 51.5 51.5 51.5 51.5 ... $ Police_Force : int 1 1 1 1 1 1 1 1 1 1 ... $ Accident_Severity : int 3 3 3 3 3 3 3 3 3 3 ... $ Number_of_Vehicles : int 2 1 2 2 2 2 2 2 2 1 ... $ Number_of_Casualties : int 1 1 1 1 1 1 1 1 1 1 ... #etc
As you can see from this part of the structure of the data, fields such as the Police Force and Accident Severity are coded rather than using strings.
For example, an Accident Severity of 1 would indicate that it was fatal for at least one person.
The DfT also publishes a guide to the data with a spreadsheet key.
We can add in pages of this spreadsheet to help decode the data. As always I’m interested in how the data varies across Britain, so I will decode the Local Authority.
LA_lookup <- read.csv("lookup_LAD.csv") #merge the local authority list with the accidents data accidents_LA <- merge(accidents, LA_lookup, by.x = "Local_Authority_.District.", by.y = "code") #check to see whether the local authorities are in > head(summary(accidents_LA$label)) Birmingham Leeds Westminster Cornwall Bradford Glasgow City 18430 13449 10653 9126 8844 8720
To work out which month is the worst, we will need the months to be in date format
> str(accidents_LA$Date) Factor w/ 2557 levels "01/01/2010","01/02/2010",..: 918 220 1689 136 28 1197 315 1768 298 354 ...
We’ll use the as.Date() function that we’ve encountered before to change the dates into date format:
accidents_LA$Date <- as.Date(accidents_LA$Date, format = "%d/%m/%Y") > str(accidents_LA$Date) Date[1:1013118], format: "2012-08-16" "2010-04-19" "2014-12-19" "2010-04-12" "2010-04-03" "2013-05-09" ...
We can tinker with the dates to create month/year dates (e.g. November 2016) and just months (e.g. April):
accidents_LA$Month_Yr <- format(as.Date(accidents_LA$Date), "%Y-%m") accidents_LA$Month <- format(as.Date(accidents_LA$Date), "%m") > head(accidents_LA$Month_Yr) [1] "2012-08" "2010-04" "2014-12" "2010-04" "2010-04" "2013-05"
First of all, let’s work out which has been the worst month overall:
byDate <- accidents_LA %>% group_by(Month) %>% summarise(accidents = n()) #descending order byDate <- arrange(Date, desc(accidents)) > byDate # A tibble: 12 x 2 Month accidents <chr> <int> 1 11 92579 2 10 92102 3 07 90049 4 09 87925 5 06 85641 6 05 84957 7 08 83749 8 03 81742 9 12 81514 10 01 79835 11 04 77846 12 02 75179
We can see that between 2010 and 2016, November (month 11) has been the worst month for accidents with 92,579.
This is about 460 more than October despite October having 31 days to November’s 30.
Now, let’s see which local authorities also have November as the worst month:
Given that November is the worst month overall, we would expect a majority or plurality of local authorities to be the same, assuming there are no massive outliers skewing the data.
To work this out, we’re going to use a combination of the table() function and tidyr.
byMonth_LA <- data.frame(table(accidents_LA$label, accidents_LA$Month)) Currently our data frame looks like this:
> head(byMonth_LA) Var1 Var2 Freq 1 Aberdeen City 01 179 2 Aberdeenshire 01 270 3 Adur 01 85 4 Allerdale 01 132 5 Alnwick 01 0 6 Amber Valley 01 158
Effectively each local authority has 12 rows of data – one per month. It would be much easier if these months were displayed horizontally, rather than vertically.
Happily tidyr’s gather() function deals with that:
byMonth_LA <- byMonth_LA %>% spread(key = Var2, value = Freq) > head(byMonth_LA) Var1 01 02 03 04 05 06 07 08 09 10 11 12 1 Aberdeen City 179 165 147 171 192 160 181 183 187 180 189 174 2 Aberdeenshire 270 294 240 220 268 269 258 318 300 277 247 235 3 Adur 85 75 81 85 78 93 92 98 85 86 82 89 4 Allerdale 132 146 129 129 134 142 159 139 130 148 147 148 5 Alnwick 0 0 0 0 0 0 0 0 0 0 0 0 6 Amber Valley 158 124 143 138 131 131 152 157 142 183 162 147
Much easier to understand
Now we can find the highest value for the year (thanks to Stack Overflow for some help on this bit):
byMonth_LA$max <- apply(byMonth_LA[,2:13], 1, max)
And finally, for now we can just work out whether November is the highest or not:
byMonth_LA$max_month <- ifelse(byMonth_LA$max == byMonth_LA[,12],yes = byMonth_LA$max_month <- "November", no = byMonth_LA$max_month <- "other")
One word of caution for the above ifelse() formula – it won’t factor in any joint results. So if November and October are both first for a particular area with 100 accidents each, then it will just say ‘November’ and not indicate it was a joint result.
> byMonth_LA[10:20,] Var1 01 02 03 04 05 06 07 08 09 10 11 12 max max_month 10 Ashfield 186 143 160 139 164 185 177 163 185 160 186 163 186 November 11 Ashford 203 176 191 177 205 231 259 226 241 192 207 202 259 other 12 Aylesbury Vale 204 192 243 179 230 205 229 224 211 227 252 226 252 November 13 Babergh 115 125 120 138 143 143 134 130 161 144 171 135 171 November 14 Barking and Dagenham 269 234 248 263 277 261 274 274 293 308 333 261 333 November 15 Barnet 613 570 615 578 619 677 583 530 584 650 721 599 721 November 16 Barnsley 301 275 298 293 305 295 327 305 307 313 361 334 361 November 17 Barrow-in-Furness 76 69 63 67 83 84 66 78 97 73 99 83 99 November 18 Basildon 218 213 219 205 235 257 251 195 248 206 259 210 259 November 19 Basingstoke and Deane 195 178 174 179 203 209 204 211 196 210 194 207 211 other 20 Bassetlaw 201 175 168 157 154 155 176 173 157 186 168 166 201 other
It looks like many local authorities do indeed have November as the worst month.
Working out which month and year combination was the worst for each local authority (LA) is a similar task:
months <- accidents_LA %>% group_by(Month_Yr,label) %>% summarise(freq = n()) months <- months %>% spread(key = Month_Yr, value = freq)
As is working out which was the worst day of all for each LA:
alldays <- accidents_LA %>% group_by(Date,label) %>% summarise(freq = n())
I printed off these spreadsheets to work with them afterwards.
Here’s the story that came out of it in the Birmingham Mail.
The RAC’s Rod Dennis had this to say about why November was the most dangerous month:
November marks the beginning of shorter days with much less daylight and often a deterioration in the weather, so it perhaps makes sense that it is also a month when accidents peak – although there are large regional variations.
It is also the first month in the year when road users are beginning to get used to there being less light – unlike at the end of the winter when they will have become accustomed to both dark mornings and afternoons.
“So it makes sense that drivers do everything they can to be prepared for darker driving conditions, starting with ensuring all their exterior lights are in working order, together with things like their vehicles’ tyres which are vital in maintaining grip in wetter and icier conditions.
Conclusion
According to the Department for Transport, August is the busiest month for traffic – however November is the busiest for urban roads.
Further investigation would be to try and combine this data with traffic flows, but that’s beyond the scope of this post.
This will just be the start of my STATS19 analysis of 2016 in R.
It is one of the richest datasets the Government publishes, so I will be revisiting it often!
Picture credit: © Copyright Robin Webster and licensed for reuse under this Creative Commons Licence.