SQL and R: Joins and Merges

Next up in our mini-series on SQL and R are the JOIN family of functions.

They work in the same way as the merge() function in R. To recap the basic syntax of merge():

merged_df <- merge(df1, df2, by.x = "code1", by.y = "code2")

Merge works using a column that is common to both data frames. Usually this will be some kind of ID field where each value is unique. If the common field has the same name in each data frame you can use the by argument, otherwise you’ll need by.x and by.y.

Inner join

An inner join takes only rows with a match in both tables. If your common field is an ID, only records with the same ID in BOTH tables will be included.

In the following examples, I took two data sets from the 2011 Census. In one of them I deleted all but 17 rows of the data.

You will find all 17 of the IDs of the second data set in the first data set because every local authority level Census spreadsheet has all the local authorities in England and Wales in it.

This means that our inner join will be 17 rows long. All 17 records in data set 2 are in data set 1, but only those 17 from data set 1 are in data set 2.

SELECT *
FROM TABLE 2
INNER JOIN TABLE 3 ON TABLE 2.geography code = TABLE 3.geography code

Here is a screenshot of the output:

R equivalent

bulk <- read.csv("bulk.csv")
bulk1 <- read.csv("bulk (1).csv")

bulk_merged <- merge(bulk1, bulk, by = "geography.code")

Left join

A LEFT JOIN takes all the values in the first table plus the records that match those in the second table.

In our example, if there are values in TABLE 2 not in TABLE 3, then they will still be included but there will be NULL values in the right hand side of the join, as seen in the example.

SELECT * 
FROM TABLE 2 
LEFT JOIN TABLE 3 ON TABLE 2.geography code = TABLE 3.geography code

R equivalent

The R equivalent makes use of the all.x argument. Set that to true and it will include all records in the first data set, match or not.

bulk_merged <- merge(bulk, bulk1, by = "geography.code", all.x = TRUE)

Right join

The RIGHT JOIN is the same, except all the values in the second table (TABLE 3 in our example) are kept regardless of a match.

In our example, as all the values in TABLE 3 are in TABLE 2, this one gives the same effect as the INNER JOIN.

SELECT *
FROM TABLE 2 
RIGHT JOIN TABLE 3 ON TABLE 2.geography code = TABLE 3.geography code

R equivalent

The same result can be achieved by using all.y = TRUE.

bulk_merged <- merge(bulk, bulk1, by = "geography.code", all.y = TRUE)

Conclusion

For more information, W3 Schools pages on joins have useful Venn diagrams showing which records will be included in an inner, left and right join.

Next up in our mini-series on SQL and R are the JOIN family of functions.

Inner join

R equivalent

Left join

R equivalent

Right join

R equivalent

Conclusion

Share this:

Related