In the last post we looked at five common SQL commands and their R equivalents using 300 rows of Land Registry data.
To recap, we covered:
- Find the average price of sales
- Find the most expensive sales
- Look for towns containing ‘Bat’ in their name
- Find the average price by type of property
- Change a field
Here are five more simple tasks in SQL with the R equivalents:
Task 1: Filter by multiple values
SQL command:
SELECT * FROMTABLE 1
WHERElocal_authority
IN ("STOCKPORT", "MANCHESTER")
R command:
lr <- read.csv("lr.csv",header=FALSE) colNames <- c("id", "price", "date", "postcode", "type", "new_build", "hold", "house_name2", "house_name1", "street", "town", "town2", "local_authority", "county", "PPD_category_type", "record_status") names(lr) <- colNames LAs <- c("MANCHESTER","STOCKPORT") manc_stockport <- subset(lr, local_authority %in% LAs)
Task 2: Make a count of sales by street
SQL command:
SELECT COUNT(street) AS 'Street', street
FROM TABLE 1
GROUP BY street
R command:
by_street <- lr %>% group_by(street) %>% summarise(Count = n())
Task 3: Select distinct (unique) values in a dataset
SQL command:
SELECT DISTINCT type FROM TABLE 1
R command:
> unique(lr$type) [1] F T D S Levels: D F S T
Similar: perform a count of unique values
SQL command:
SELECT COUNT(DISTINCT type) FROM TABLE 1
R command:
length(unique(lr$type))
Task 4: Change blanks to NULL/NA values
In R there is a difference between a blank cell (“”) and a NA.
In SQL it’s the same with NULL values and blank cells.
Having NA values in R means you can use functions like is.na() or arguments such as na.rm. It also looks neater and avoids any hidden whitespace.
SQL command:
First, let’s note the difference in the number of rows. Many fields have a blank house name (“”) rather than NULL values
UPDATE TABLE 1
SET house_name=NULL WHERE house_name=""
After performing this command:
R command:
lr$house_name1 <- as.character(lr$house_name1) lr$house_name1[lr$house_name1 == ""] <- NA
Task 5: Remove two unneeded columns
We don’t really need the two columns furthest to the right (15 and 16) in our data, so let’s get rid of them. I didn’t name them in SQL so they are known as COL 15
and COL 16.
SQL command:
ALTER TABLETABLE 1
DROPCOL 15
, DROPCOL 16
R command:
lr <- lr[,1:14]
Conclusion:
Those were five more easy SQL commands and their R equivalents. In the future I intend to look at JOINs in SQL and how they relate to merge() in R.