This is the course handbook for WolfWorks: An introduction to R.


Objectives:

  1. Appreciate the benefits of working within the tidyverse
  2. Understand the differences and similarities between data.frame and tibble structures
  3. Apply the pipe (%>%) function to simplify the consecutive use of multiple commands
  4. Use key functions to manipulate data within thetidyverse - select(), filter(), mutate(), pull(), group_by() and summarise()

Installing and loading packages with additional functionality

Everything we have done so far has utilised Base-R, that is the basic software which is automatically included when you download R. Base packages are installed and loaded by default and provide all of the basic functions required to allow R to function.

In addition to our Base-R packages, there are thousands of other R packages which we can make use of to improve our data analysis and facilitate extra functions. To use a package which is not part of Base-R, we first need to install the package using install.packages (a Base-R function). This only needs to be done once (until you update your R or RStudio).

Let’s install the tidyverse package.

# install the tidyverse package
install.packages("tidyverse")

When you install a package R will also install all other packages on which the package depends, termed package dependencies. This is because many functions rely on the internal use of other functions to run correctly.

Once a package has been installed we need to load it into the current RStudio session so that we can use it. To load a package we use the Base-R library function. We cannot use the functions within a package until it is both installed and loaded. Whilst the installation only needs to be done once (until updates), we have to load the package every time we start a new RStudio session (i.e., when we close RStudio and then re-open it). Let’s load the tidyverse package that we have installed.

library("tidyverse")
## ── Attaching core tidyverse packages ──────────────────────────────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.4.4     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.0
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

Introduction to the tidyverse

The tidyverse is an umbrella package, that is to say that it comprises of many useful packages that are all designed to complement each other and work together. The tidyverse packages include ggplot2, tidyr, dplyr and tibble, among many others. We will make use of several of these during the remainder of the workshop.

All packages that are part of the tidyverse were designed with the same philosophy, coding grammar and data structures. The primary aim, as you can probably guess from the name, is to promote tidy data and tidy data analysis.

The tibble structure vs. Base-R data.frame

In the previous lesson we used the data.frame structure within Base-R to store our data. A tibble is essentially an improved version of a data.frame which is compatible with all packages and functions in the tidyverse.

We can load our data directly into a tibble using the read_csv function (not the read.csv function which would give us a Base-R data.frame). Alternatively, if we already had a data.frame we could convert this into a tibble using as_tibble().

## Load data into R as a tibble
rna_tbl <- read_csv(file = "raw_data/rnaseq_data.csv")
## Rows: 32428 Columns: 19
## ── Column specification ────────────────────────────────────────────────────────────────────────────────
## Delimiter: ","
## chr (14): gene, sample, organism, sex, infection, strain, tissue, product, ensembl_gene_id, external...
## dbl  (5): expression, age, time, mouse, ENTREZID
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
## Look at the structure of our tibble
str(rna_tbl)
## spc_tbl_ [32,428 × 19] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ gene                                 : chr [1:32428] "Asl" "Apod" "Cyp2d22" "Klk6" ...
##  $ sample                               : chr [1:32428] "GSM2545336" "GSM2545336" "GSM2545336" "GSM2545336" ...
##  $ expression                           : num [1:32428] 1170 36194 4060 287 85 ...
##  $ organism                             : chr [1:32428] "Mus musculus" "Mus musculus" "Mus musculus" "Mus musculus" ...
##  $ age                                  : num [1:32428] 8 8 8 8 8 8 8 8 8 8 ...
##  $ sex                                  : chr [1:32428] "Female" "Female" "Female" "Female" ...
##  $ infection                            : chr [1:32428] "InfluenzaA" "InfluenzaA" "InfluenzaA" "InfluenzaA" ...
##  $ strain                               : chr [1:32428] "C57BL/6" "C57BL/6" "C57BL/6" "C57BL/6" ...
##  $ time                                 : num [1:32428] 8 8 8 8 8 8 8 8 8 8 ...
##  $ tissue                               : chr [1:32428] "Cerebellum" "Cerebellum" "Cerebellum" "Cerebellum" ...
##  $ mouse                                : num [1:32428] 14 14 14 14 14 14 14 14 14 14 ...
##  $ ENTREZID                             : num [1:32428] 109900 11815 56448 19144 80891 ...
##  $ product                              : chr [1:32428] "argininosuccinate lyase, transcript variant X1" "apolipoprotein D, transcript variant 3" "cytochrome P450, family 2, subfamily d, polypeptide 22, transcript variant 2" "kallikrein related-peptidase 6, transcript variant 2" ...
##  $ ensembl_gene_id                      : chr [1:32428] "ENSMUSG00000025533" "ENSMUSG00000022548" "ENSMUSG00000061740" "ENSMUSG00000050063" ...
##  $ external_synonym                     : chr [1:32428] "2510006M18Rik" NA "2D22" "Bssp" ...
##  $ chromosome_name                      : chr [1:32428] "5" "16" "15" "7" ...
##  $ gene_biotype                         : chr [1:32428] "protein_coding" "protein_coding" "protein_coding" "protein_coding" ...
##  $ phenotype_description                : chr [1:32428] "abnormal circulating amino acid level" "abnormal lipid homeostasis" "abnormal skin morphology" "abnormal cytokine level" ...
##  $ hsapiens_homolog_associated_gene_name: chr [1:32428] "ASL" "APOD" "CYP2D6" "KLK6" ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   gene = col_character(),
##   ..   sample = col_character(),
##   ..   expression = col_double(),
##   ..   organism = col_character(),
##   ..   age = col_double(),
##   ..   sex = col_character(),
##   ..   infection = col_character(),
##   ..   strain = col_character(),
##   ..   time = col_double(),
##   ..   tissue = col_character(),
##   ..   mouse = col_double(),
##   ..   ENTREZID = col_double(),
##   ..   product = col_character(),
##   ..   ensembl_gene_id = col_character(),
##   ..   external_synonym = col_character(),
##   ..   chromosome_name = col_character(),
##   ..   gene_biotype = col_character(),
##   ..   phenotype_description = col_character(),
##   ..   hsapiens_homolog_associated_gene_name = col_character()
##   .. )
##  - attr(*, "problems")=<externalptr>

Our tibble is still a data.frame but it also has some additional classes (tbl_df and tbl). These are the classes tibble inherits from, which in simple terms means that tibble is a data.frame with a few modifications. Therefore, most of the functions that we previously used with data.frame can also be used with tibble.

What is the difference between a data.frame and a tibble?

If we can do everything we previously did with a data.frame with our tibble, what is the point of a tibble?

The modifications made to a data.frame to generate a tibble have one main purpose - to create a neater format of data which can be manipulated in simpler ways using packages and functions from the tidyverse. The tibble is a tidyverse-compatible data.frame. Other than being able to use a tibble for “tidy” data processing and plotting, several other attributes of a tibble contribute to its tidyness:

  • Printing a tibble will not automatically print the entire data set, but just the first 10 rows and the number of columns that fit on the screen
  • The data type contained in each column is specified underneath the column name - this reduces the need for us to use class() or str()
  • tibbles are stricter and produce more warnings e.g., if you try to subset based on a column that does not exist

The pipe operator

One of the most influential additions that we gain by using the tidyverse is the pipe operator (%>% or |>) from the magrittr package (loaded as one of the tidyverses packages).

The function of a pipe is to pass the output of one function (on its left) directly as an input to another function (on its right). This reduces the need for difficult-to-read nested functions. For example, if I wanted to take the mean of all expression values in our data set and round this value to two significant figures:

## In base-R using nested functions
signif(mean(rna_tbl$expression, na.rm = TRUE), digits = 2)
## [1] 1900
## In the tidyverse using the pipe operator
rna_tbl %>%
  pull(expression) %>%
  mean(na.rm = TRUE) %>%
  signif(digits = 2)
## [1] 1900

It is much easier to see step-by-step what the code is doing when we use the pipe operator. We have also introduced a new function - pull(). This is a function that is commonly used in the tidyverse to extract a specific column as a vector. We will see more of this soon.


Data manipulation using dplyr and tidyr

Both dplyr and tidyr are two of the packages installed and loaded within the tidyverse umbrella package. The most common functions within the dplyr package are:

Selecting columns and filtering rows - select() and filter()

To select only the columns that we’re interested in we can make use of the select() function. This can be particularly important when working with a tibble that is very large since we have already discovered that printing a tibble will only display the number of columns that fit on our screen - we want to make sure that these are the columns we are interested in.

Using the $ operator is quite convenient for selecting a single column and extracting the values as a vector. Selecting several columns using the [] subsetting operator is a bit more cumbersome.

## Subsetting columns based on []
rna_tbl[, c("gene", "expression", "age", "sex")]
## # A tibble: 32,428 × 4
##    gene    expression   age sex   
##    <chr>        <dbl> <dbl> <chr> 
##  1 Asl           1170     8 Female
##  2 Apod         36194     8 Female
##  3 Cyp2d22       4060     8 Female
##  4 Klk6           287     8 Female
##  5 Fcrls           85     8 Female
##  6 Slc2a4         782     8 Female
##  7 Exd2          1619     8 Female
##  8 Gjc2           288     8 Female
##  9 Plp1         43217     8 Female
## 10 Gnb4          1071     8 Female
## # ℹ 32,418 more rows
## Subsetting columns using select()
rna_tbl %>%
  select(gene, expression, age, sex)
## # A tibble: 32,428 × 4
##    gene    expression   age sex   
##    <chr>        <dbl> <dbl> <chr> 
##  1 Asl           1170     8 Female
##  2 Apod         36194     8 Female
##  3 Cyp2d22       4060     8 Female
##  4 Klk6           287     8 Female
##  5 Fcrls           85     8 Female
##  6 Slc2a4         782     8 Female
##  7 Exd2          1619     8 Female
##  8 Gjc2           288     8 Female
##  9 Plp1         43217     8 Female
## 10 Gnb4          1071     8 Female
## # ℹ 32,418 more rows

We can also use the “-” sign to select all columns except for those that we are not interested in.

rna_tbl %>%
  select(-gene, -expression)


Question: Take a look at the above code. What else differs when referring to columns in the tidyverse?

Another thing that we previously used the [] for was conditional subsetting of rows. Again, the dplyr package contains a convenient and tidy function for doing this - filter().

## Conditionally subsetting rows based on []
rna_tbl[rna_tbl$gene == "Exd2", ]

## Conditionally subsetting rows using filter()
rna_tbl %>%
  filter(gene == "Exd2")

The most convenient thing about using the select() and filter() functions is that their output (the subsetted data) can be piped straight into another function. Using the base-R [] subsetting we would either have to assign the subsetted data to a new object and then start a new command with this object, or create a difficult-to-read nested command.

Question: If I wanted to use both select() and filter() to get a subset of rows and columns, which order would be the best to do this in?

What is the difference between pull() and select()

The pull() function returns a single column as a vector; the select() function returns one or more columns as a data frame. This means that we can pipe the output of pull() straight into a function expecting numbers.

## Using pull to return a numeric vector
rna_tbl %>%
  pull(expression) %>%
  median(na.rm = TRUE)
## [1] 547
## Using select does not return a numeric vector 
rna_tbl %>%
  select(expression) %>%
  median(na.rm = TRUE)



Creating new columns with mutate()

It is very common to want to modify a column or create a new column based on existing columns. For example, we might want to convert units (multiply all values in a column by a conversion factor) or find the ratio between two columns. For this we can use the mutate() function.

Let’s create a new column in our data which stores time in hours rather than days.

## Use mutate() to create new columns
rna_tbl %>%
  mutate(time_hrs = time * 24) %>%
  select(time, time_hrs)
## # A tibble: 32,428 × 2
##     time time_hrs
##    <dbl>    <dbl>
##  1     8      192
##  2     8      192
##  3     8      192
##  4     8      192
##  5     8      192
##  6     8      192
##  7     8      192
##  8     8      192
##  9     8      192
## 10     8      192
## # ℹ 32,418 more rows

When the mutate() function is executed, each new column is generated one-by-one. This means that we can actually create a second column based on our new column, all within a single command.

## Use mutate to create new columns
rna_tbl %>%
  mutate(time_hrs = time * 24, 
         time_mins = time_hrs * 60) %>%
  select(time, time_hrs, time_mins)
## # A tibble: 32,428 × 3
##     time time_hrs time_mins
##    <dbl>    <dbl>     <dbl>
##  1     8      192     11520
##  2     8      192     11520
##  3     8      192     11520
##  4     8      192     11520
##  5     8      192     11520
##  6     8      192     11520
##  7     8      192     11520
##  8     8      192     11520
##  9     8      192     11520
## 10     8      192     11520
## # ℹ 32,418 more rows



Split-apply-combine data analysis using group_by() and summarise()

Most data contains information about groups, and these are often defined by factors. For example, in our data we have males and females, time points 0, 4 and 8 days and all of our genes. In some cases we may want to do analysis on a per-group basis. Hence, we split the data into groups, apply some analysis to each group, and then combine the results. This process is made easy by the group_by() function.

rna_tbl %>%
  group_by(sex) 
## # A tibble: 32,428 × 19
## # Groups:   sex [2]
##    gene    sample   expression organism   age sex   infection strain  time tissue mouse ENTREZID product
##    <chr>   <chr>         <dbl> <chr>    <dbl> <chr> <chr>     <chr>  <dbl> <chr>  <dbl>    <dbl> <chr>  
##  1 Asl     GSM2545…       1170 Mus mus…     8 Fema… Influenz… C57BL…     8 Cereb…    14   109900 argini…
##  2 Apod    GSM2545…      36194 Mus mus…     8 Fema… Influenz… C57BL…     8 Cereb…    14    11815 apolip…
##  3 Cyp2d22 GSM2545…       4060 Mus mus…     8 Fema… Influenz… C57BL…     8 Cereb…    14    56448 cytoch…
##  4 Klk6    GSM2545…        287 Mus mus…     8 Fema… Influenz… C57BL…     8 Cereb…    14    19144 kallik…
##  5 Fcrls   GSM2545…         85 Mus mus…     8 Fema… Influenz… C57BL…     8 Cereb…    14    80891 Fc rec…
##  6 Slc2a4  GSM2545…        782 Mus mus…     8 Fema… Influenz… C57BL…     8 Cereb…    14    20528 solute…
##  7 Exd2    GSM2545…       1619 Mus mus…     8 Fema… Influenz… C57BL…     8 Cereb…    14    97827 exonuc…
##  8 Gjc2    GSM2545…        288 Mus mus…     8 Fema… Influenz… C57BL…     8 Cereb…    14   118454 gap ju…
##  9 Plp1    GSM2545…      43217 Mus mus…     8 Fema… Influenz… C57BL…     8 Cereb…    14    18823 proteo…
## 10 Gnb4    GSM2545…       1071 Mus mus…     8 Fema… Influenz… C57BL…     8 Cereb…    14    14696 guanin…
## # ℹ 32,418 more rows
## # ℹ 6 more variables: ensembl_gene_id <chr>, external_synonym <chr>, chromosome_name <chr>,
## #   gene_biotype <chr>, phenotype_description <chr>, hsapiens_homolog_associated_gene_name <chr>

We can see at the top of our tibble that we have two groups based on the factor sex. We could similarly group by time point.

rna_tbl %>%
  group_by(time)
## # A tibble: 32,428 × 19
## # Groups:   time [3]
##    gene    sample   expression organism   age sex   infection strain  time tissue mouse ENTREZID product
##    <chr>   <chr>         <dbl> <chr>    <dbl> <chr> <chr>     <chr>  <dbl> <chr>  <dbl>    <dbl> <chr>  
##  1 Asl     GSM2545…       1170 Mus mus…     8 Fema… Influenz… C57BL…     8 Cereb…    14   109900 argini…
##  2 Apod    GSM2545…      36194 Mus mus…     8 Fema… Influenz… C57BL…     8 Cereb…    14    11815 apolip…
##  3 Cyp2d22 GSM2545…       4060 Mus mus…     8 Fema… Influenz… C57BL…     8 Cereb…    14    56448 cytoch…
##  4 Klk6    GSM2545…        287 Mus mus…     8 Fema… Influenz… C57BL…     8 Cereb…    14    19144 kallik…
##  5 Fcrls   GSM2545…         85 Mus mus…     8 Fema… Influenz… C57BL…     8 Cereb…    14    80891 Fc rec…
##  6 Slc2a4  GSM2545…        782 Mus mus…     8 Fema… Influenz… C57BL…     8 Cereb…    14    20528 solute…
##  7 Exd2    GSM2545…       1619 Mus mus…     8 Fema… Influenz… C57BL…     8 Cereb…    14    97827 exonuc…
##  8 Gjc2    GSM2545…        288 Mus mus…     8 Fema… Influenz… C57BL…     8 Cereb…    14   118454 gap ju…
##  9 Plp1    GSM2545…      43217 Mus mus…     8 Fema… Influenz… C57BL…     8 Cereb…    14    18823 proteo…
## 10 Gnb4    GSM2545…       1071 Mus mus…     8 Fema… Influenz… C57BL…     8 Cereb…    14    14696 guanin…
## # ℹ 32,418 more rows
## # ℹ 6 more variables: ensembl_gene_id <chr>, external_synonym <chr>, chromosome_name <chr>,
## #   gene_biotype <chr>, phenotype_description <chr>, hsapiens_homolog_associated_gene_name <chr>

Now we have three groups based on the time column.

Once we have grouped our data, subsequent operations will be carried out on a per-group basis. This is facilitated by the summarise() function. Let’s try to find the mean expression across all genes at each time point.

rna_tbl %>%
  group_by(time) %>%
  summarise(mean_expression = mean(expression, na.rm = TRUE))
## # A tibble: 3 × 2
##    time mean_expression
##   <dbl>           <dbl>
## 1     0           1888.
## 2     4           1872.
## 3     8           1820.

We can also group by more than one factor at once. For example, if we wanted the mean expression value for males and females at each timepoint.

rna_tbl %>% 
  group_by(sex, time) %>%
  summarise(mean_expression = mean(expression, na.rm = TRUE))
## `summarise()` has grouped output by 'sex'. You can override using the `.groups` argument.
## # A tibble: 6 × 3
## # Groups:   sex [2]
##   sex     time mean_expression
##   <chr>  <dbl>           <dbl>
## 1 Female     0           1853.
## 2 Female     4           1892.
## 3 Female     8           1901.
## 4 Male       0           1934.
## 5 Male       4           1852.
## 6 Male       8           1712.

Similarly, once the data is grouped, we can also summarise several variables at the same time.

rna_tbl %>%
  group_by(sex, time) %>%
  summarise(mean_expression = mean(expression, na.rm = TRUE),
            med_expression = median(expression, na.rm = TRUE))
## `summarise()` has grouped output by 'sex'. You can override using the `.groups` argument.
## # A tibble: 6 × 4
## # Groups:   sex [2]
##   sex     time mean_expression med_expression
##   <chr>  <dbl>           <dbl>          <dbl>
## 1 Female     0           1853.           528.
## 2 Female     4           1892.           568 
## 3 Female     8           1901.           557 
## 4 Male       0           1934.           573 
## 5 Male       4           1852.           535 
## 6 Male       8           1712.           496.



Challenge: Using the tidyverse for data manipulation and analysis

Use the dplyr() functions that we have discussed to answer the following questions:

  1. How many protein_coding genes were measured in this study?
  2. What is the mean expression of ‘lncRNA’ and ‘miRNA’ genes for males and females at 8-days post-treatment?
  3. Identify genes associated with the “abnormal DNA methylation” phenotype, and calculate their mean expression (in log) at time 0-, 4- and 8-days post-treatment.
Solution


rna_tbl %>%
  group_by(gene_biotype) %>%
  summarise(number = length(time))
## # A tibble: 13 × 2
##    gene_biotype                       number
##    <chr>                               <int>
##  1 IG_C_gene                              22
##  2 TEC                                    88
##  3 lncRNA                               1518
##  4 miRNA                                 154
##  5 polymorphic_pseudogene                 44
##  6 processed_pseudogene                 1298
##  7 protein_coding                      29062
##  8 scaRNA                                 22
##  9 snoRNA                                110
## 10 transcribed_processed_pseudogene       22
## 11 transcribed_unitary_pseudogene         22
## 12 transcribed_unprocessed_pseudogene     22
## 13 unprocessed_pseudogene                 44
## Can be done easily with count()
rna_tbl %>%
  count(gene_biotype)
## # A tibble: 13 × 2
##    gene_biotype                           n
##    <chr>                              <int>
##  1 IG_C_gene                             22
##  2 TEC                                   88
##  3 lncRNA                              1518
##  4 miRNA                                154
##  5 polymorphic_pseudogene                44
##  6 processed_pseudogene                1298
##  7 protein_coding                     29062
##  8 scaRNA                                22
##  9 snoRNA                               110
## 10 transcribed_processed_pseudogene      22
## 11 transcribed_unitary_pseudogene        22
## 12 transcribed_unprocessed_pseudogene    22
## 13 unprocessed_pseudogene                44
rna_tbl %>%
  filter(gene_biotype == "lncRNA" | gene_biotype == "miRNA") %>%
  group_by(sex, gene_biotype) %>%
  summarise(mean_exp = mean(expression, na.rm = TRUE))
## `summarise()` has grouped output by 'sex'. You can override using the `.groups` argument.
## # A tibble: 4 × 3
## # Groups:   sex [2]
##   sex    gene_biotype mean_exp
##   <chr>  <chr>           <dbl>
## 1 Female lncRNA         800.  
## 2 Female miRNA            9.02
## 3 Male   lncRNA         190.  
## 4 Male   miRNA           11.4
rna_tbl %>%
  filter(phenotype_description == "abnormal DNA methylation") %>%
  group_by(gene, time) %>%
  summarise(mean_exp = mean(expression, na.rm = TRUE)) %>%
  mutate(log_mean_exp = log(mean_exp))
## `summarise()` has grouped output by 'gene'. You can override using the `.groups` argument.
## # A tibble: 6 × 4
## # Groups:   gene [2]
##   gene   time mean_exp log_mean_exp
##   <chr> <dbl>    <dbl>        <dbl>
## 1 Xist      0   24164         10.1 
## 2 Xist      4   20184.         9.91
## 3 Xist      8   24609.        10.1 
## 4 Zdbf2     0     534.         6.28
## 5 Zdbf2     4     570.         6.35
## 6 Zdbf2     8     496.         6.21