Back in August 2014, around the 100th anniversary of the outbreak of the First World War, the Data Unit published our analysis of the Commonwealth War Graves Commission‘s records of fallen soldiers, airmen, sailors and other servicemen and women who gave their lives during the next four years.
Today is Armistice Day 2017, which means that in one year’s time we will reach the 100th anniversary of the end of the war.
But many more young men would be killed before peace was finally declared.
From Tableau to R
When I first started analysing the World War I data in 2014 I used Tableau.
Tableau is a good tool for analysing large datasets. It can handle large volumes of data and doesn’t have a steep learning curve to get started with it, unlike R.
Now however, I use R to analyse the data.
library(dplyr) #load the CSV - note the 'header=FALSE' to indicate it has no header row war <- read.csv("casualties_ww1_SUPER_NEW.csv", header = FALSE) #remove the unnecessary timestamp war$V8 <- gsub(" 00:00:00.000", "", war$V8) lastyear <- war[war$V8 >= "1917-11-11", ] lastyear <- lastyear[lastyear$V8 <= "1918-11-11",]
To make the stories easier to understand, I selected only those who died between November 11, 1917 and November 11, 1918.
Of course, the declaration of peace did not heal the wounds of those already injured, and more soldiers went on to die after November 11, 1918.
The next step is to add in the ranks spreadsheet to match the numbered rank codes to their ranks.
ranks <- read.csv("Ranks.csv") lastyear_ranks <- merge(lastyear, ranks, by.x = "V3", by.y = "rank", all.x = TRUE)
Most of the biographical details are stored as free text in one field.
To find the names of soldiers from a particular area, we have to search in this field.
For Bath, for example:
bath <- dplyr::filter(lastyear_ranks, grepl("Bath",V11))
This filter does a ‘contains’ search, picking up all cases where this word
The problem with this kind of search is that includes all (case-sensitive) mentions of the word Bath.
That will pick up Bath Road, Bath Street, Bath Cottages, plus any word containing the word Bath such as the Somerset village of Bathealton.
The only way to clean the data is to go through it as thoroughly as I can removing any cases where these crop up.
You can either do this in R or you can print off the spreadsheet and go through it in OpenOffice. I prefer doing this step in OpenOffice.
Here is the finished result, as published by the Bath Chronicle.