Cheat Sheet
Welcome to the R cheat sheet.
Here you’ll find the most useful functions and techniques I’ve found in R all in one place.
This post is a work in progress – new functions will be catalogued in here as and when.
This cheat sheet assumes you have RStudio downloaded and set up on your computer.
RStudio has the mtcars dataset pre-loaded. This will be the example dataset for much of the cheat sheet.
Remember that there are often multiple ways with R to arrive at the same result!
I want to:
- Make RStudio read some data from my spreadsheet
- See what my data looks like
- See a sample of my data
- Filter my data frame based on some data in one of the columns
- Merge data from several columns into one column vertically
- Merge two or more identical data frames underneath each other
- Do a find and replace
- Print off a CSV file from a R data frame
- Sort your data frame in ascending or descending order
- Remove any duplicate values
- Remove an object from my environment
- Test to see whether something is true or false
- Change some data to numeric format
- Format dates in R
By function:
Reading data from a spreadsheet up into R
To read data from your CSV file, use the read.csv formula
#Create a dataframe using [your_df_name] <- read.csv("your_spreadsheet.csv") #Worked example using this spreadsheet [link] (Sheet1, downloadable as a CSV. #Make sure it goes in your R folder for RStudio). data <- read.csv("Food - Sheet1.csv") #This spreadsheet has header rows #Sheet2 has the same data without header rows #If you have data like this, use header = FALSE no_names <- read.csv("Food - Sheet2.csv", header = FALSE) no_names V1 V2 1 Cheese Yes 2 Bacon Yes 3 Olives No 4 Sweetcorn No 5 Asparagus Yes 6 Tomato Yes 7 Sweet potato No 8 Parsnip Yes 9 Lime Yes 10 Pear Yes #Notice it fills in V1 and V2 as column names #The default is stringsAsFactors = TRUE #This means that all duplicated strings (words usually) will be grouped together #For example, stringsAsFactors = TRUE with Sheet1 will group Yes x 7 and No x 3: str(data) data.frame': 10 obs. of 2 variables: $ Food : Factor w/ 10 levels "Asparagus","Bacon",..: 3 2 5 9 1 10 8 6 4 7 $ Tasty.: Factor w/ 2 levels "No","Yes": 2 2 1 1 2 2 1 2 2 2 #If you wanted each to be treated separately and not grouped, #set stringsAsFactors = FALSE no_factors <- read.csv("Food - Sheet1.csv", stringsAsFactors = FALSE) str(no_factors) data.frame': 10 obs. of 2 variables: $ Food : chr "Cheese" "Bacon" "Olives" "Sweetcorn" ... $ Tasty.: chr "Yes" "Yes" "No" "No" ... #Generally, it's more useful to have stringsAsFactors set to TRUE, #which is the default
Seeing the structure of your data
The str function gives you an overview of how your data looks.
#use the str function str(mtcars) 'data.frame': 32 obs. of 11 variables: $ mpg : num 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ... $ cyl : num 6 6 4 6 8 6 8 4 4 6 ... $ disp: num 160 160 108 258 360 ... $ hp : num 110 110 93 110 175 105 245 62 95 123 ... $ drat: num 3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ... $ wt : num 2.62 2.88 2.32 3.21 3.44 ... $ qsec: num 16.5 17 18.6 19.4 17 ... $ vs : num 0 0 1 1 0 1 0 1 1 1 ... $ am : num 1 1 1 0 0 0 0 0 0 0 ... $ gear: num 4 4 4 3 3 3 3 4 4 4 ... $ carb: num 4 4 1 1 2 1 4 2 2 4 ... #all the $ signs refer to a different column in the data frame. #the '...' means there is more data than what's displayed
Seeing a sample of your data
See the beginning, end or a random bit of your dataset.
#Use sample, head or tail #sample takes a random sample from anywhere in your data #head takes the beginning of the dataset #tail takes the end head(mtcars) mpg cyl disp hp drat wt qsec vs am gear Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 carb Mazda RX4 4 Mazda RX4 Wag 4 Datsun 710 1 Hornet 4 Drive 1 Hornet Sportabout 2 Valiant 1 #These are all columns, even though mtcars?carb is underneath the rest
Filter your data frame
Filter your data frame based on the data in one column. Thanks to Sharon Machlis for the tips in searching for numbers and strings here.
#To search for numbers, use this: #new_data <- existing_data[existing_data$col==[number], ] #so to search for all six cylinder cars in mtcars: six_cyl_data <- mtcars[mtcars$cyl==6, ] > six_cyl_data <- mtcars[mtcars$cyl==6, ] > six_cyl_data mpg cyl disp hp drat wt qsec vs am gear carb Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4 Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4 Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1 Valiant 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1 Merc 280 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4 Merc 280C 17.8 6 167.6 123 3.92 3.440 18.90 1 0 4 4 Ferrari Dino 19.7 6 145.0 175 3.62 2.770 15.50 0 1 5 6 #For strings (usually words), use grep and regular expressions An easy way to remember this is needle in a haystack, #where the needle is what you are looking for and the haystack where you are looking #filter <- grep("[needle]", "[haystack]") > str(iris) 'data.frame': 150 obs. of 5 variables: $ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ... $ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ... $ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ... $ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ... $ Species : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ... grep("setosa",iris$Species) #filter iris just for setosa > setosa <- grep("setosa",iris$Species) #apply that to iris data frame > setosa_df <- iris[setosa, ] #our new data frame just has the setosa species > setosa_df Sepal.Length Sepal.Width Petal.Length Petal.Width Species 1 5.1 3.5 1.4 0.2 setosa 2 4.9 3.0 1.4 0.2 setosa 3 4.7 3.2 1.3 0.2 setosa 4 4.6 3.1 1.5 0.2 setosa 5 5.0 3.6 1.4 0.2 setosa 6 5.4 3.9 1.7 0.4 setosa 7 4.6 3.4 1.4 0.3 setosa 8 5.0 3.4 1.5 0.2 setosa 9 4.4 2.9 1.4 0.2 setosa 10 4.9 3.1 1.5 0.1 setosa 11 5.4 3.7 1.5 0.2 setosa 12 4.8 3.4 1.6 0.2 setosa 13 4.8 3.0 1.4 0.1 setosa 14 4.3 3.0 1.1 0.1 setosa 15 5.8 4.0 1.2 0.2 setosa 16 5.7 4.4 1.5 0.4 setosa 17 5.4 3.9 1.3 0.4 setosa 18 5.1 3.5 1.4 0.3 setosa 19 5.7 3.8 1.7 0.3 setosa 20 5.1 3.8 1.5 0.3 setosa 21 5.4 3.4 1.7 0.2 setosa 22 5.1 3.7 1.5 0.4 setosa 23 4.6 3.6 1.0 0.2 setosa 24 5.1 3.3 1.7 0.5 setosa 25 4.8 3.4 1.9 0.2 setosa 26 5.0 3.0 1.6 0.2 setosa 27 5.0 3.4 1.6 0.4 setosa 28 5.2 3.5 1.5 0.2 setosa 29 5.2 3.4 1.4 0.2 setosa 30 4.7 3.2 1.6 0.2 setosa 31 4.8 3.1 1.6 0.2 setosa 32 5.4 3.4 1.5 0.4 setosa 33 5.2 4.1 1.5 0.1 setosa 34 5.5 4.2 1.4 0.2 setosa 35 4.9 3.1 1.5 0.2 setosa 36 5.0 3.2 1.2 0.2 setosa 37 5.5 3.5 1.3 0.2 setosa 38 4.9 3.6 1.4 0.1 setosa 39 4.4 3.0 1.3 0.2 setosa 40 5.1 3.4 1.5 0.2 setosa 41 5.0 3.5 1.3 0.3 setosa 42 4.5 2.3 1.3 0.3 setosa 43 4.4 3.2 1.3 0.2 setosa 44 5.0 3.5 1.6 0.6 setosa 45 5.1 3.8 1.9 0.4 setosa 46 4.8 3.0 1.4 0.3 setosa 47 5.1 3.8 1.6 0.2 setosa 48 4.6 3.2 1.4 0.2 setosa 49 5.3 3.7 1.5 0.2 setosa 50 5.0 3.3 1.4 0.2 setosa
Merge data from several columns into one column vertically
We can use the gather() function to take data from several horizontal columns and display it vertically, without losing the distinctions in the column names. This is great for moving data spread across several years into a single year, allowing for plotting on geom_line() in ggplot2.
library(tidyr) #create some imaginary unemployment data covering 2014 and 2015 > messy_data <- data.frame (country = c("UK","France","Germany"), unemployment_2014 = c(5.5,7.9,9.3), unemployment_2015 = c(4.9,6.4,8.6)) > messy_data country unemployment_2014 unemployment_2015 1 UK 5.5 4.9 2 France 7.9 6.4 3 Germany 9.3 8.6 > new_data <- gather(messy_data, year, percent, unemployment_2014:unemployment_2015) > new_data country year percent 1 UK unemployment_2014 5.5 2 France unemployment_2014 7.9 3 Germany unemployment_2014 9.3 4 UK unemployment_2015 4.9 5 France unemployment_2015 6.4 6 Germany unemployment_2015 8.6 #the new table contains the years in a single column, called year #and the values for each year in a single column, called percent #alternative syntax > new_data <- messy_data %>% gather(year, percent, unemployment_2014:unemployment_2015) > new_data country year percent 1 UK unemployment_2014 5.5 2 France unemployment_2014 7.9 3 Germany unemployment_2014 9.3 4 UK unemployment_2015 4.9 5 France unemployment_2015 6.4 6 Germany unemployment_2015 8.6
Merge several identical datasets together, one underneath another
Use rbind(). Make sure your data frames have the same number of columns, named identically.
#create first data frame > df1 <- data.frame(numbers = c(22,33), text = c("Some","text")) > df1 numbers text 1 22 Some 2 33 text #create second data frame > df2 <- data.frame(numbers = c(44,55,66), text = c("Some", "more","text")) > df2 numbers text 1 44 Some 2 55 more 3 66 text #use rbind to combine the two > df3 <- rbind(df1,df2) > df3 numbers text 1 22 Some 2 33 text 3 44 Some 4 55 more 5 66 text
Do a find and replace
#Use the gsub() function. The syntax is as follows: gsub("Text to find","Text to replace with",where_to_look) #example > df <- data.frame(text = c("original", "text")) > df text 1 original 2 text > > df$text <- gsub("original","new",df$text) > df text 1 new 2 text
Print off a CSV file from a R data frame
Use the write.csv function. The basic syntax is as follows: write.csv(data_frame,"your_file.csv")
It will print off a CSV file into your current directory.
Order your data frame
#Use the order() function as follows: df2 <- df[order(df$column_to_sort),]
Make sure you remember the comma at the end or you’ll get an error
the default order is ascending. Put a minus sign in front of column_to_sort to sort descending. Remember this will not plot in order in ggplot2.
To sort your data quickly, an alternative way is to click on your data frame in the ‘data’ section of RStudio.
From there, a new tab will open to the right of your script. Click on that to get a spreadsheet-style visualisation of your data. Click on one of the tabs to sort the data in ascending order, then again to sort descending. Remember – this won’t sort your data permanently.
Remove any duplicate values from your data frame
#use the unique() function as follows > df <- data.frame(V1 = c("no", "no", "duplicates","will","will","be","left","left")) > df2 <- unique(df) > df2 V1 1 no 3 duplicates 4 will 6 be 7 left
Remove an object from your environment
#use rm() as follows
to_delete <- data.frame(NA)
rm(to_delete)
#data frame is deleted from the environment
You can also use the grid option in RStudio, in the top-right corner, to select multiple objects for deletion easily
Test to see whether a statement is TRUE or FALSE
#use ifelse() as follows: ifelse(statement, iftrue, iffalse) df <- data.frame(x = c(4,8)) #check to see whether figures are greater than five ifelse(df$x > 5, TRUE, FALSE) > ifelse(df$x > 5, TRUE, FALSE) [1] FALSE TRUE
Format data as numeric values
> vector <- c("1","2","3") #this won't work because it's a character vector > sum(vector) Error in sum(vector) : invalid 'type' (character) of argument > str(vector) chr [1:3] "1" "2" "3" #change to numeric and it works > vector <- as.numeric(vector) > sum(vector) [1] 6
Change some data into date format
Use the as.Date() function to turn a character date format into a R date format.
The format argument of as.Date() should detail the format of your date, including any spaces, slashes etc. For details on all possible date formats, go here.
myDate <- "1 January 2018" > str(myDate) chr "1 January 2018" > #change to date format > myDate <- as.Date(myDate, format = "%e %B %Y") > str(myDate) Date[1:1], format: "2018-01-01" > #notice the reformatting of the date > myDate [1] "2018-01-01"