I’ve been taking a look at SQL lately and how it relates to R.
SQL is a programming language to manipulate data in databases. Its syntax is different from R. It resembles standard English with clauses more so than R does.
In this post I’m going to look at some simple SQL commands and their R equivalents.
In order to follow along to this post, you’ll need access to some kind of tool to manage an SQL database, such as PHPMyAdmin. I haven’t found one for free yet (if you know of one let me know).
The dataset we’ll use is from the Land Registry
I downloaded the latest month from the Land Registry’s Price Paid data. I took the first 300 rows as my sample data and saved it. I’m using PHPMyAdmin for the SQL side.
Step 1: upload the data
SQL command: Click “Import” and upload your CSV file. It automatically names the table as ‘Table 1’
R command:
setwd("Your directory") lr <- read.csv("your_csv.csv", header = FALSE)
Step 2: Name the columns
SQL command: Structure -> click Change next to the column you want to change, then type in your new name in Name
Alternatively, using SQL:
ALTER TABLETABLE 1
CHANGECOL 2 price
VARCHAR(7) CHARACTER SET utf8 COLLATE utf8_general_ci NULL DEFAULT NULL;
R command:
#headers from here https://www.gov.uk/guidance/about-the-price-paid-data#explanations-of-column-headers-in-the-ppd colNames <- c("id", "price", "date", "postcode", "type", "new_build", "hold", "house_name2", "house_name1", "street", "town", "town2", "local_authority", "county", "PPD_category_type", "record_status") names(lr) <- colNames
Task 1: Find the average price of all sales in the data
SQL command:
SELECT AVG(price) FROM TABLE 1
The structure of the command is as follows:
SELECT FUNCTION(relevant_column) FROM your_table
We also use back ticks (“) for any names that have a space between them. WordPress converts the back ticks into TABLE 1.
Slightly more advanced, using AS renames your temporary average column as ‘Average price’:
SELECT AVG(price) ASAverage price
FROMTABLE 1
R command:
> mean(lr$price) [1] 191912.3
Task 2: Find the most expensive sales
SQL command:
SELECT * FROM TABLE 1
ORDER BY price DESC LIMIT 5
Make sure your price column is set to INT (integer). You can do this by going to Structure -> Change -> Type.
The * symbol means all.
The LIMIT 5 command limits the query results to five records. This is useful to speed up queries and ensure you don’t crash.
The default in ORDER BY is ascending so we need to add DESC.
R command:
library(dplyr) lr <- arrange(lr, desc(price)) lr[1:5,]
Task 3: Look for towns containing ‘Bat’ in their name
SQL command:
SELECT * FROM TABLE 1
WHERE town LIKE 'BAT%'
The LIKE condition serves as the filter. The % sign is a wildcard meaning ‘anything’, which would filter towns like Bath, Batheaston or anything else beginning Bat.
R command:
library(dplyr) filter(lr, grepl('BAT', town))
Task 4: Find the average price by type of property
SQL command:
SELECT AVG(price) AS `Average price`, type FROM `TABLE 1` GROUP BY type
We are selecting the average price (renaming it using AS Average price
) and type. We need type as well otherwise it will just give us the averages with no labels. The GROUP BY condition provides the pivot table type function.
R command:
library(dplyr) by_type <- lr %>% group_by(type) %>% summarise(Average_price = mean(price))
Task 5: Change a field
SQL command:
UPDATE TABLE 1
SET type = "DETACHED" WHERE type = "D"
R command:
lr$type <- gsub("D","DETACHED", lr$type)
Conclusion
In the future I’ll post some more SQL/R crossover examples.
Perhaps it’s the hosting package I have but I found it difficult for the phpMyAdmin to hold even one full month of Land Registry data (about 90,000 records of 16 fields). Is that unusual?
Finally, almost everything I do now in R uses the dplyr package. I now inhabit the tidyverse.