Scraping in R: Access to mortgage petition

Over the past few years a good source of data has been Parliament’s petitions website.

Anyone can start petitions or sign them.

MPs have to consider the ones that get to 100,000 signatures for debates.

The most popular petitions often end up leading the news cycle, such as the petition arguing for a second EU referendum, the competing ones for and against Donald Trump to make a state visit to the UK and the one calling for the Meningitis B vaccine to be given to all babies.

On Monday 23rd October MPs debated whether paying rent should be proof enough to get a mortgage.

I wanted to find out where the most people had signed the petition around the UK.

We know where people signed it because you have to enter a postcode when you sign it, which is assigned to your constituency.

Previously I would have used OutWit Hub to scrape the data, but this time I used R.

What is scraping?

Scraping is the process of extracting data from web pages into a more machine-readable format such as CSV, an HTML table or JSON data.

It works by selecting all the information between parameters in the code that makes up websites.

For example, in HTML the <strong> tag is used to put text in bold.

So some HTML code might look like this:

<p>Here is some text in <strong>bold</strong>.</p>

Which would look like this:

Here is some text in bold.

If you set your scraper to select everything in the <strong> tags, it would take the word above plus all the other times content in bold appears on this page.

How to scrape in R

The main package for scraping in R is rvest. The first thing to do is to load up the relevant packages and bring in the URL we want using the read_html() function.

library(rvest)
library(tidyr)
library(dplyr)

page <- read_html("https://petition.parliament.uk/archived/petitions/186565.json")

Our page looks like this in a browser:

JSON is designed to be read by computers and not by humans, hence the ugly layout.

As this is an introduction to scraping in R, we’ll stick to one page for now. The real magic comes in when you want to scrape multiple pages, which we’ll get to in a later post.

Next up you want to use html_nodes() to extract the data you want.


How do you know what data you want?

The easiest way to figure out what you want is to inspect the code behind the web page.

In Google Chrome, you can do this by right-clicking and then clicking ‘inspect’.

As I said, it’s a bit of an unusual page to begin with, so let’s try it briefly with a more typical HTML page.

Here is a BBC sport report on Huddersfield Town beating Manchester United.

Inspecting the element takes us here:

Our cursor is hovering over the byline section of the code (i.e. the bit that says ‘by Matt Davis’).

The HTML tags at the bottom of the screenshot indicate the path to this line of code within the HTML document.

This is how we can scrape his byline:

> report <- read_html("http://www.bbc.co.uk/sport/football/41618803")
> byline <- report %>% html_nodes('div.gel-flag__body') %>% html_text()
> byline
[1] " By Matt Davis BBC Sport "

You can see here that the div.gel-flat__body class is one that throws up his name.

This isn’t an exact science: sometimes you have to play around with the tags to get what you want.

The html_text() function can be swapped out for others as well such as html_attr(). We won’t go into this now – see here for more details.

This code will work on other BBC Sport reports as well, such as this one from Chelsea v Watford game earlier on that day.

All you need to do is change the HTML to be read:

report <- read_html("http://www.bbc.co.uk/sport/football/41618800")
byline <- report %>% html_nodes('div.gel-flag__body') %>% html_text()
> byline [1] " By Matthew Henry BBC Sport "

To return to the petition data

This command using html_nodes() again will get us all the data we need. It’s a relatively simple one because it’s JSON formatted data.

results <- page %>% html_nodes('body') %>% html_text()

But there’s a catch…

It’s an enormous character vector with everything all jumbled up together.

Let’s begin to clean the data.

results_df <- data.frame(strsplit(results, '\\{"name":"'))
names(results_df) <- "names"

This command will bring in our data into a data frame, splitting the data along the name of each constituency.

A typical row now looks like this:

> results_df$names[100]
[1] Berwickshire, Roxburgh and Selkirk","ons_code":"S14000008","mp":"John Lamont MP","signature_count":101},

All we need to do now to clean the data is to split up this column into new columns, separating out the ONS code, the MP’s name and the signature count.

For this we can use tidyr’s separate() function.

results_df <- results_df %>% separate(names, into = c("names","code"), sep = '\\","ons_code":"')

This splits the names column into two, called names and code, where the separator is the ons_code tag in JSON format. The double-backslash is used to ‘escape’ the inverted commas, which are considered special characters.

Our row now looks like this:

> str(results_df[100,])
'data.frame': 1 obs. of  2 variables: 
$ names: chr "Berwickshire, Roxburgh and Selkirk" 
$ code : chr "S14000008\",\"mp\":\"John Lamont MP\",\"signature_count\":101},"

Repeat this two more times:

results_df <- results_df %>% separate(code, into = c("code","mp"), sep = '\\","mp":"')
results_df <- results_df %>% separate(mp, into = c("mp","signatures"), sep = '\\","signature_count":')
#change signature count from character to numeric format
results_df$signatures <- as.numeric(gsub("},","",results_df$signatures))

Our data is now cleaned up

We can see that Plymouth, Sutton and Devonport had the most signatures of any seat, with 1,204.

This is useful to know but in the UK constituencies are different sizes. To get a truly representative picture we need the populations of the different seats.

This requires a bit of work to put together for the whole UK – the ONS version only includes England and Wales.

pop <- read.csv("constituencies_pop.csv",stringsAsFactors = FALSE)
results_merged <- merge(results_df, pop, by.x = "code", by.y = "PCON11CD")
results_merged$one_in_x <-results_merged$All.Ages / results_merged$signatures

Analysis

Here are the top 10 constituencies with proportionally the most signatures:

results_ordered <- results_merged[order(results_merged$one_in_x),]
 results_ordered$names[1:10]

> results_ordered$names[1:10]
 [1] "South West Devon" "Plymouth, Sutton and Devonport" "Plymouth, Moor View" 
 [4] "Torbay" "Newton Abbot" "South East Cornwall" 
 [7] "Hove" "Worthing West" "Totnes" 
[10] "East Worthing and Shoreham"

Interestingly people in Devon seem very concerned about access to mortgages, something I flagged up in my story in Devon Live.

The top five constituencies with the highest proportions of signatures are all in Devon.

Further east along England’s south coast, Hove, Worthing West, East Worthing and Shoreham and Brighton Pavilion all had high proportions of signatories.

Conclusion

One caveat with the data is that you don’t have to prove that you actually live at the postcode you enter before signing the petition.

It’s also possible to sign the petition from abroad, although you have to be a British citizen or UK resident to do so.

Having said that, petitions are a very good source of data for journalists on topical political issues. There isn’t a whole lot of detailed local polling/survey data about topical issues, which makes the petitions data a valuable resource.

For those interested, here’s a transcript of the debate from Monday.

We’ll be talking more about scraping at a later date on more regular HTML pages.

Leave a Reply

Your email address will not be published. Required fields are marked *