Next up in our mini-series on SQL and R are the JOIN family of functions.
They work in the same way as the merge() function in R. To recap the basic syntax of merge():
merged_df <- merge(df1, df2, by.x = "code1", by.y = "code2")
Merge works using a column that is common to both data frames. Usually this will be some kind of ID field where each value is unique. If the common field has the same name in each data frame you can use the by argument, otherwise you’ll need by.x and by.y.
Inner join
An inner join takes only rows with a match in both tables. If your common field is an ID, only records with the same ID in BOTH tables will be included.
In the following examples, I took two data sets from the 2011 Census. In one of them I deleted all but 17 rows of the data.
You will find all 17 of the IDs of the second data set in the first data set because every local authority level Census spreadsheet has all the local authorities in England and Wales in it.
This means that our inner join will be 17 rows long. All 17 records in data set 2 are in data set 1, but only those 17 from data set 1 are in data set 2.
SELECT * FROMTABLE 2
INNER JOINTABLE 3
ONTABLE 2
.geography code
=TABLE 3
.geography code
Here is a screenshot of the output:
R equivalent
bulk <- read.csv("bulk.csv") bulk1 <- read.csv("bulk (1).csv") bulk_merged <- merge(bulk1, bulk, by = "geography.code")
Left join
A LEFT JOIN takes all the values in the first table plus the records that match those in the second table.
In our example, if there are values in TABLE 2 not in TABLE 3, then they will still be included but there will be NULL values in the right hand side of the join, as seen in the example.
SELECT * FROMTABLE 2
LEFT JOINTABLE 3
ONTABLE 2
.geography code
=TABLE 3
.geography code
R equivalent
The R equivalent makes use of the all.x argument. Set that to true and it will include all records in the first data set, match or not.
bulk_merged <- merge(bulk, bulk1, by = "geography.code", all.x = TRUE)
Right join
The RIGHT JOIN is the same, except all the values in the second table (TABLE 3 in our example) are kept regardless of a match.
In our example, as all the values in TABLE 3 are in TABLE 2, this one gives the same effect as the INNER JOIN.
SELECT * FROMTABLE 2
RIGHT JOINTABLE 3
ONTABLE 2
.geography code
=TABLE 3
.geography code
R equivalent
The same result can be achieved by using all.y = TRUE.
bulk_merged <- merge(bulk, bulk1, by = "geography.code", all.y = TRUE)
Conclusion
For more information, W3 Schools pages on joins have useful Venn diagrams showing which records will be included in an inner, left and right join.