At work I’ve been working on a project called Find My Seat, which is now live. You type in your postcode and it presents you with lots of useful data about the economy, healthcare, immigration and Brexit in your area. I pulled all the data used in the widget together.
One task I had to figure out was: how do you work out the mean age of a constituency?
There are two ways: an easy way using spreadsheets and a hard way using for loops in R. I discovered the hard way before the easy way…
Using R
I used these for loops to work out the mean age of each Westminster parliamentary constituency in England and Wales.
First of all, let’s get the data from the ONS population estimates.
It publishes the data as follows:
data <- read.csv("ageByConstituency.csv",stringsAsFactors = FALSE) str(data) 'data.frame': 573 obs. of 94 variables: $ PCON11CD: chr "E14000530" "E14000531" "E14000532" "E14000533" ... $ PCON11NM: chr "Aldershot" "Aldridge-Brownhills" "Altrincham and Sale West" "Amber Valley" ... $ All.Ages: int 105653 77670 99970 88985 98989 104653 119036 91374 117278 120809 ... $ X0 : int 1509 814 1079 916 848 1164 1496 1219 1682 1523 ... $ X1 : int 1445 780 1195 883 828 1248 1429 1228 1603 1450 ... $ X2 : int 1460 806 1166 984 896 1261 1578 1259 1725 1539 ... $ X3 : int 1457 788 1346 960 1016 1326 1611 1288 1766 1566 ... $ X4 : int 1408 821 1381 999 1069 1361 1573 1252 1795 1722 ... $ X5 : int 1404 847 1331 1041 1104 1266 1558 1258 1679 1605 ... $ X6 : int 1323 913 1392 884 1086 1261 1621 1119 1625 1638 ...
‘X0’ here means ‘aged 0’, in other words babies who have not yet reached their first birthday. ‘X1’ means between one and two, ‘X2’ between two and three and so on.
So we can see that there are 105,653 people in Aldershot, of whom 1,509 have not yet reached their first birthday, 1,445 are between one and two and so on.
Let’s create a new data frame called mean_ages. To do that we’ll take the ONS codes ($PCON11CD) and constituency names ($PCON11NM) and leave the rest. We’ll add in a blank column where the mean values will go. We’ll also create a blank data frame called all_ages.
mean_ages <- data[,1:2] mean_ages$mean <- NA all_ages <- data.frame() data.frame': 573 obs. of 3 variables: $ PCON11CD: chr "E14000530" "E14000531" "E14000532" "E14000533" ... $ PCON11NM: chr "Aldershot" "Aldridge-Brownhills" "Altrincham and Sale West" "Amber Valley" ... $ mean : logi NA NA NA NA NA NA ...
Next up is quite a complicated for loop inside another for loop:
constituency = 0 for (i in 1:573) { constituency = constituency +1 age = -1 year = 3 for (i in 1:91) { age = age+1 year = year+1 year_df <- data.frame(age,seq(1:data[constituency,year])) colnames(year_df) <- c("age","number") all_ages <- rbind(all_ages, year_df) } mean_ages[constituency,3] <- mean(all_ages$age) }
Explanation
The first for loop runs through all 573 of the constituencies in the data. The constituency+1 command makes sure of that. That’s why we set constituency=0 outside the loop. We start our age and year variables one behind where they need to be because when the second loop starts they will have one added to them every time. In other words, when the second loop starts age will move from -1 to 0 (the first possible age of the youngest babies) and the year will move from three to four (the fourth row is the first row of population data by age).
The second for loop runs 91 times inside each of the 573 iterations of i. It goes through all the possible ages from 0 to 90, where 90 is 90 and all ages above that.
For each age it creates a data frame called ‘year_df’ with as many rows as there are people of that age. It then fills in a sequence of numbers from one to the total number of rows for that year group. So for example, for Aldershot the first year_df will be 1,509 rows long. It will have two columns – one saying ‘0’ some 1,509 times and the other a sequence of numbers from 1 to 1,509.
The all_ages data frame continually adds and stores each year_df one after the other so that by the end for Aldershot we will have a dataset 105,693 rows long correctly divided up in proportion to the age profile of the town. It uses rbind to do this
Finally we take a mean of the age from all_ages and add that to our prized data frame of mean_ages.
Results
The mean age of Aldershot is 38.1.
The youngest constituency in England and Wales is Birmingham Ladywood, where the average is 30.3 years old.
The oldest is Christchurch in Dorset, where the average is 49.2 years old.
A simpler way using spreadsheets
I’m not a fan of using R for R’s sake. If there’s a simpler way, use that. And there is:
In the spreadsheet, simply multiply each age by the number of people in that age group for that constituency, then divide by the number of people.
So for Aldershot, the total ages of everybody there added up to 4,023,604. Everyone in Aldershot combined is four million years old.
Divided by 105,653 makes an average of 38.1.
In a roundabout way, R is doing exactly this with this for loop.
I showed you this R working because it was good practice of for loops for me. You’ll see more for loops in future.