Cheat Sheet

Welcome to the R cheat sheet.

Here you’ll find the most useful functions and techniques I’ve found in R all in one place.

This post is a work in progress – new functions will be catalogued in here as and when.

This cheat sheet assumes you have RStudio downloaded and set up on your computer.

RStudio has the mtcars dataset pre-loaded. This will be the example dataset for much of the cheat sheet.

I want to:

  1. Make RStudio read some data from my spreadsheet
  2. See what my data looks like
  3. See a sample of my data
  4. Filter my data frame based on some data in one of the columns
  5. Merge data from several columns into one column vertically
  6. Merge two or more identical data frames underneath each other
  7. Do a find and replace
  8. Print off a CSV file from a R data frame
  9. Sort your data frame in ascending or descending order

By function:

  1. read.csv
  2. str
  3. head
  4. gather
  5. rbind
  6. gsub
  7. write.table
  8. order

Reading data from your spreadsheet

Get your data from a spreadsheet up into R

#To read data from your CSV file, use the read.csv formula
#Create a dataframe using [your_df_name] <- read.csv("your_spreadsheet.csv")

#Worked example using this spreadsheet [link] (Sheet1, downloadable as a CSV. 
#Make sure it goes in your R folder for RStudio).

data <- read.csv("Food - Sheet1.csv")

#This spreadsheet has header rows
#Sheet2 has the same data without header rows
#If you have data like this, use header = FALSE

no_names <- read.csv("Food - Sheet2.csv", header = FALSE)

no_names
 V1 V2
1 Cheese Yes
2 Bacon Yes
3 Olives No
4 Sweetcorn No
5 Asparagus Yes
6 Tomato Yes
7 Sweet potato No
8 Parsnip Yes
9 Lime Yes
10 Pear Yes

#Notice it fills in V1 and V2 as column names
#The default is stringsAsFactors = TRUE
#This means that all duplicated strings (words usually) will be grouped together
#For example, stringsAsFactors = TRUE with Sheet1 will group Yes x 7 and No x 3:

str(data)
data.frame': 10 obs. of 2 variables:
 $ Food : Factor w/ 10 levels "Asparagus","Bacon",..: 3 2 5 9 1 10 8 6 4 7
 $ Tasty.: Factor w/ 2 levels "No","Yes": 2 2 1 1 2 2 1 2 2 2

#If you wanted each to be treated separately and not grouped, 
#set stringsAsFactors = FALSE

no_factors <- read.csv("Food - Sheet1.csv", stringsAsFactors = FALSE)

str(no_factors)
data.frame': 10 obs. of 2 variables:
 $ Food : chr "Cheese" "Bacon" "Olives" "Sweetcorn" ...
 $ Tasty.: chr "Yes" "Yes" "No" "No" ...

#Generally, it's more useful to have stringsAsFactors set to TRUE, 
#which is the default

Seeing the structure of your data

The str function gives you an overview of how your data looks.

#use the str function

str(mtcars)

'data.frame': 32 obs. of 11 variables:
 $ mpg : num 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
 $ cyl : num 6 6 4 6 8 6 8 4 4 6 ...
 $ disp: num 160 160 108 258 360 ...
 $ hp : num 110 110 93 110 175 105 245 62 95 123 ...
 $ drat: num 3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
 $ wt : num 2.62 2.88 2.32 3.21 3.44 ...
 $ qsec: num 16.5 17 18.6 19.4 17 ...
 $ vs : num 0 0 1 1 0 1 0 1 1 1 ...
 $ am : num 1 1 1 0 0 0 0 0 0 0 ...
 $ gear: num 4 4 4 3 3 3 3 4 4 4 ...
 $ carb: num 4 4 1 1 2 1 4 2 2 4 ...

#all the $ signs refer to a different column in the data frame.
#the '...' means there is more data than what's displayed

See the beginning, end or a random bit of your dataset.

#Use sample, head or tail
#sample takes a random sample from anywhere in your data
#head takes the beginning of the dataset
#tail takes the end

head(mtcars)

 mpg cyl disp hp drat wt qsec vs am gear
Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4
Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4
Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4
Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3
Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3
Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3
 carb
Mazda RX4 4
Mazda RX4 Wag 4
Datsun 710 1
Hornet 4 Drive 1
Hornet Sportabout 2
Valiant 1

#These are all columns, even though mtcars?carb is underneath the rest

Filter your data frame

Filter your data frame based on the data in one column. Thanks to Sharon Machlis for the tips in searching for numbers and strings here.

#To search for numbers, use this:
#new_data <- existing_data[existing_data$col==[number], ]

#so to search for all six cylinder cars in mtcars:
six_cyl_data <- mtcars[mtcars$cyl==6, ]

> six_cyl_data <- mtcars[mtcars$cyl==6, ]
> six_cyl_data
 mpg cyl disp hp drat wt qsec vs am gear carb
Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4
Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4
Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1
Valiant 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1
Merc 280 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4
Merc 280C 17.8 6 167.6 123 3.92 3.440 18.90 1 0 4 4
Ferrari Dino 19.7 6 145.0 175 3.62 2.770 15.50 0 1 5 6

#For strings (usually words), use grep and regular expressions
An easy way to remember this is needle in a haystack,
#where the needle is what you are looking for and the haystack where you are looking
#filter <- grep("[needle]", "[haystack]")

> str(iris)
'data.frame': 150 obs. of 5 variables:
 $ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
 $ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
 $ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
 $ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
 $ Species : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...

grep("setosa",iris$Species)

#filter iris just for setosa
> setosa <- grep("setosa",iris$Species)

#apply that to iris data frame
> setosa_df <- iris[setosa, ]

#our new data frame just has the setosa species
> setosa_df
 Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3.0 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 4.6 3.1 1.5 0.2 setosa
5 5.0 3.6 1.4 0.2 setosa
6 5.4 3.9 1.7 0.4 setosa
7 4.6 3.4 1.4 0.3 setosa
8 5.0 3.4 1.5 0.2 setosa
9 4.4 2.9 1.4 0.2 setosa
10 4.9 3.1 1.5 0.1 setosa
11 5.4 3.7 1.5 0.2 setosa
12 4.8 3.4 1.6 0.2 setosa
13 4.8 3.0 1.4 0.1 setosa
14 4.3 3.0 1.1 0.1 setosa
15 5.8 4.0 1.2 0.2 setosa
16 5.7 4.4 1.5 0.4 setosa
17 5.4 3.9 1.3 0.4 setosa
18 5.1 3.5 1.4 0.3 setosa
19 5.7 3.8 1.7 0.3 setosa
20 5.1 3.8 1.5 0.3 setosa
21 5.4 3.4 1.7 0.2 setosa
22 5.1 3.7 1.5 0.4 setosa
23 4.6 3.6 1.0 0.2 setosa
24 5.1 3.3 1.7 0.5 setosa
25 4.8 3.4 1.9 0.2 setosa
26 5.0 3.0 1.6 0.2 setosa
27 5.0 3.4 1.6 0.4 setosa
28 5.2 3.5 1.5 0.2 setosa
29 5.2 3.4 1.4 0.2 setosa
30 4.7 3.2 1.6 0.2 setosa
31 4.8 3.1 1.6 0.2 setosa
32 5.4 3.4 1.5 0.4 setosa
33 5.2 4.1 1.5 0.1 setosa
34 5.5 4.2 1.4 0.2 setosa
35 4.9 3.1 1.5 0.2 setosa
36 5.0 3.2 1.2 0.2 setosa
37 5.5 3.5 1.3 0.2 setosa
38 4.9 3.6 1.4 0.1 setosa
39 4.4 3.0 1.3 0.2 setosa
40 5.1 3.4 1.5 0.2 setosa
41 5.0 3.5 1.3 0.3 setosa
42 4.5 2.3 1.3 0.3 setosa
43 4.4 3.2 1.3 0.2 setosa
44 5.0 3.5 1.6 0.6 setosa
45 5.1 3.8 1.9 0.4 setosa
46 4.8 3.0 1.4 0.3 setosa
47 5.1 3.8 1.6 0.2 setosa
48 4.6 3.2 1.4 0.2 setosa
49 5.3 3.7 1.5 0.2 setosa
50 5.0 3.3 1.4 0.2 setosa

Merge data from several columns into one column vertically

We can use the gather function to take data from several horizontal columns and display it vertically, without losing the distinctions in the column names. This is great for moving data spread across several years into a single year, allowing for plotting on geom_line in ggplot2.

library(tidyr)
#create some imaginary unemployment data covering 2014 and 2015

> messy_data <- data.frame 
(country = c("UK","France","Germany"),
unemployment_2014 = c(5.5,7.9,9.3), 
unemployment_2015 = c(4.9,6.4,8.6))
> messy_data
 country unemployment_2014 unemployment_2015
1 UK 5.5 4.9
2 France 7.9 6.4
3 Germany 9.3 8.6

> new_data <- gather(messy_data, year, percent, 
unemployment_2014:unemployment_2015)
> new_data
 country year percent
1 UK unemployment_2014 5.5
2 France unemployment_2014 7.9
3 Germany unemployment_2014 9.3
4 UK unemployment_2015 4.9
5 France unemployment_2015 6.4
6 Germany unemployment_2015 8.6

#the new table contains the years in a single column, called year
#and the values for each year in a single column, called percent

#alternative syntax
> new_data <- messy_data %>% gather(year, percent, 
unemployment_2014:unemployment_2015)
> new_data
 country year percent
1 UK unemployment_2014 5.5
2 France unemployment_2014 7.9
3 Germany unemployment_2014 9.3
4 UK unemployment_2015 4.9
5 France unemployment_2015 6.4
6 Germany unemployment_2015 8.6

Merge several identical datasets together, one underneath another

Use rbind. Make sure your data frames have the same number of columns, named identically.

#create first data frame
> df1 <- data.frame(numbers = c(22,33), text = c("Some","text"))
> df1
 numbers text
1 22 Some
2 33 text

#create second data frame
> df2 <- data.frame(numbers = c(44,55,66), text = c("Some", "more","text"))
> df2
 numbers text
1 44 Some
2 55 more
3 66 text

#use rbind to combine the two
> df3 <- rbind(df1,df2)
> df3
 numbers text
1 22 Some
2 33 text
3 44 Some
4 55 more
5 66 text

Do a find and replace

Use the gsub function. The syntax is as follows:

gsub("Text to find","Text to replace with",where_to_look)
#example
> df <- data.frame(text = c("original", "text"))
> df 
 text
1 original
2 text
> 
> df$text <- gsub("original","new",df$text)
> df
 text
1 new
2 text

Print off a CSV file from a R data frame

Use the write.table function. The basic syntax is as follows:

write.table("data_frame","your_file.csv")

It will print off a CSV file into your current directory. You may have to move all the column names one column to the right in your CSV file because R automatically prints off IDs as column A.

Order your data frame

#Use the order function as follows:

df2 <- df[order(df$column_to_sort),]

Make sure you remember the comma at the end or you'll get an error
the default order is ascending. Put a minus sign in front of column_to_sort to sort descending. Remember this will not plot in order in ggplot2.

To sort your data quickly, an alternative way is to click on your data frame in the 'data' section of RStudio.

Click on the 'prisons' tab here

From there, a new tab will open to the right of your script. Click on that to get a spreadsheet-style visualisation of your data. Click on one of the tabs to sort the data in ascending order, then again to sort descending. Remember - this won't sort your data permanently.

The blue arrow means the data is sorted in descending order by that column