At the end of the last post, we had our data in a file named defective.
The data contains all the injuries caused by car accidents where a road sign, signal or marking was obscured or not working properly from 2010 to 2015 (4,672 in total).
To R or not to R?
We could continue to analyse the data in R but I decided to print it off using write.table and analyse it in Google Sheets instead.
This website is called R for Journalists, but I don’t have to use R just for the sake of it. I don’t think doing this would really have been any faster in R, so I decided not to use it this time.
A few more vlookups needed
If you remember from last time, the DfT uses numbers to replace names and phrases in the data. They publish a lookup sheet that we’ll use to add in the two key data points we are missing:
- The name of the local authority in which the accident took place
- The situation with the signs or markings (the special conditions at site)
For those of you unfamiliar with how to do vlookups in Google Sheets, this is how we do it:
- Get the cheat sheet [will download on click]
- Find and copy the Local Authority District (tab 6)
- Add this in to our data as a new tab
- Do the same with Special Conditions at Site
- Perform the vlookups
The basic formula is this:
=VLOOKUP(M2,'Local Authority District'!A:B,2,FALSE)
This says: ‘take the value in M2 and look for it in Local Authority District. If you find it, give me the value in the second column next to it’.
Calling a pivot table on the data gives us our top ten local authorities with the most accidents where traffic lights, signs or markings were found to be defective or obscured:
And here are the top 10 by year:
Glasgow is top with 126 injuries in these kinds of accidents, ahead of Manchester in second with 92 and Liverpool in third with 89.
Oldham had a very good year in 2015 – there wasn’t a single injury from an accident with dodgy road signs or markings recorded in 2015 after 20 such injuries the previous year.
Things got significantly worse, according to the data, in Bradford and Glasgow.
Remember this doesn’t mean the signs or markings caused the accidents; just that they were recorded as faulty by the police.
These accidents are also the ones recorded by police – if you don’t call the cops, it doesn’t go in the data.
This data doesn’t tell the whole story though
You might be wondering, for instance, why London doesn’t appear in the top 10. Surely its huge size would put it number one?
London is not there because the city is split into its 32 boroughs plus the City of London.
If we were to add the totals from all 33 subdivisions of London, it would give us 477 which would indeed put London top.
There are two main caveats with the data:
- The data doesn’t account for the size of the road network
- It doesn’t account for how much traffic there is along the roads
We could make adjustments for these, but I didn’t in the stories I wrote based on this data.
The most newsworthy fact here is that these kinds of accidents and injuries are happening, not the rate at which they happen.
Sometimes you can try and cram in too much data into a story in order to explain and account for absolutely everything.
The resulting story will then be very ‘data-heavy’ and will likely bore our general interest readers.
In these situations, it’s often enough to step back, remind people of the pitfalls and caveats and let them reach their own conclusions.
Concluding thoughts:
R was very useful for handling the large datasets, merging them together and filtering the specific data we were interested in.
From here, we had a choice about whether to continue with R or whether to use an ordinary spreadsheet program. There wasn’t much difference between the two, but I chose the latter. You don’t have to use R all the time!
Picture is from Oast House Archive, reused under Creative Commons