This is the course handbook for WolfWorks: An introduction to R.
Objectives:
tidyverse
data.frame
and tibble
structures%>%
) function to simplify the consecutive use of multiple commandstidyverse
- select()
, filter()
, mutate()
, pull()
, group_by()
and summarise()
Everything we have done so far has utilised Base-R, that is the basic software which is automatically included when you download R. Base packages are installed and loaded by default and provide all of the basic functions required to allow R to function.
In addition to our Base-R packages, there are thousands of other R packages which we can make use of to improve our data analysis and facilitate extra functions. To use a package which is not part of Base-R, we first need to install the package using install.packages
(a Base-R function). This only needs to be done once (until you update your R or RStudio).
Let’s install the tidyverse
package.
# install the tidyverse package
install.packages("tidyverse")
When you install a package R will also install all other packages on which the package depends, termed package dependencies
. This is because many functions rely on the internal use of other functions to run correctly.
Once a package has been installed we need to load it into the current RStudio session so that we can use it. To load a package we use the Base-R library
function. We cannot use the functions within a package until it is both installed and loaded. Whilst the installation only needs to be done once (until updates), we have to load the package every time we start a new RStudio session (i.e., when we close RStudio and then re-open it). Let’s load the tidyverse
package that we have installed.
library("tidyverse")
## ── Attaching core tidyverse packages ──────────────────────────────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.4
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.4.4 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.0
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
The tidyverse
is an umbrella package, that is to say that it comprises of many useful packages that are all designed to complement each other and work together. The tidyverse
packages include ggplot2
, tidyr
, dplyr
and tibble
, among many others. We will make use of several of these during the remainder of the workshop.
All packages that are part of the tidyverse were designed with the same philosophy, coding grammar and data structures. The primary aim, as you can probably guess from the name, is to promote tidy data and tidy data analysis.
tibble
structure vs. Base-R data.frame
In the previous lesson we used the data.frame
structure within Base-R to store our data. A tibble
is essentially an improved version of a data.frame
which is compatible with all packages and functions in the tidyverse
.
We can load our data directly into a tibble
using the read_csv
function (not the read.csv
function which would give us a Base-R data.frame
). Alternatively, if we already had a data.frame
we could convert this into a tibble
using as_tibble()
.
## Load data into R as a tibble
<- read_csv(file = "raw_data/rnaseq_data.csv") rna_tbl
## Rows: 32428 Columns: 19
## ── Column specification ────────────────────────────────────────────────────────────────────────────────
## Delimiter: ","
## chr (14): gene, sample, organism, sex, infection, strain, tissue, product, ensembl_gene_id, external...
## dbl (5): expression, age, time, mouse, ENTREZID
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
## Look at the structure of our tibble
str(rna_tbl)
## spc_tbl_ [32,428 × 19] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ gene : chr [1:32428] "Asl" "Apod" "Cyp2d22" "Klk6" ...
## $ sample : chr [1:32428] "GSM2545336" "GSM2545336" "GSM2545336" "GSM2545336" ...
## $ expression : num [1:32428] 1170 36194 4060 287 85 ...
## $ organism : chr [1:32428] "Mus musculus" "Mus musculus" "Mus musculus" "Mus musculus" ...
## $ age : num [1:32428] 8 8 8 8 8 8 8 8 8 8 ...
## $ sex : chr [1:32428] "Female" "Female" "Female" "Female" ...
## $ infection : chr [1:32428] "InfluenzaA" "InfluenzaA" "InfluenzaA" "InfluenzaA" ...
## $ strain : chr [1:32428] "C57BL/6" "C57BL/6" "C57BL/6" "C57BL/6" ...
## $ time : num [1:32428] 8 8 8 8 8 8 8 8 8 8 ...
## $ tissue : chr [1:32428] "Cerebellum" "Cerebellum" "Cerebellum" "Cerebellum" ...
## $ mouse : num [1:32428] 14 14 14 14 14 14 14 14 14 14 ...
## $ ENTREZID : num [1:32428] 109900 11815 56448 19144 80891 ...
## $ product : chr [1:32428] "argininosuccinate lyase, transcript variant X1" "apolipoprotein D, transcript variant 3" "cytochrome P450, family 2, subfamily d, polypeptide 22, transcript variant 2" "kallikrein related-peptidase 6, transcript variant 2" ...
## $ ensembl_gene_id : chr [1:32428] "ENSMUSG00000025533" "ENSMUSG00000022548" "ENSMUSG00000061740" "ENSMUSG00000050063" ...
## $ external_synonym : chr [1:32428] "2510006M18Rik" NA "2D22" "Bssp" ...
## $ chromosome_name : chr [1:32428] "5" "16" "15" "7" ...
## $ gene_biotype : chr [1:32428] "protein_coding" "protein_coding" "protein_coding" "protein_coding" ...
## $ phenotype_description : chr [1:32428] "abnormal circulating amino acid level" "abnormal lipid homeostasis" "abnormal skin morphology" "abnormal cytokine level" ...
## $ hsapiens_homolog_associated_gene_name: chr [1:32428] "ASL" "APOD" "CYP2D6" "KLK6" ...
## - attr(*, "spec")=
## .. cols(
## .. gene = col_character(),
## .. sample = col_character(),
## .. expression = col_double(),
## .. organism = col_character(),
## .. age = col_double(),
## .. sex = col_character(),
## .. infection = col_character(),
## .. strain = col_character(),
## .. time = col_double(),
## .. tissue = col_character(),
## .. mouse = col_double(),
## .. ENTREZID = col_double(),
## .. product = col_character(),
## .. ensembl_gene_id = col_character(),
## .. external_synonym = col_character(),
## .. chromosome_name = col_character(),
## .. gene_biotype = col_character(),
## .. phenotype_description = col_character(),
## .. hsapiens_homolog_associated_gene_name = col_character()
## .. )
## - attr(*, "problems")=<externalptr>
Our tibble is still a data.frame
but it also has some additional classes (tbl_df
and tbl
). These are the classes tibble
inherits from, which in simple terms means that tibble
is a data.frame
with a few modifications. Therefore, most of the functions that we previously used with data.frame
can also be used with tibble
.
What is the difference between a data.frame
and a tibble
?
If we can do everything we previously did with a data.frame
with our tibble
, what is the point of a tibble
?
The modifications made to a data.frame
to generate a tibble
have one main purpose - to create a neater format of data which can be manipulated in simpler ways using packages and functions from the tidyverse
. The tibble
is a tidyverse
-compatible data.frame
. Other than being able to use a tibble
for “tidy” data processing and plotting, several other attributes of a tibble
contribute to its tidyness:
tibble
will not automatically print the entire data set, but just the first 10 rows and the number of columns that fit on the screenclass()
or str()
tibbles
are stricter and produce more warnings e.g., if you try to subset based on a column that does not exist One of the most influential additions that we gain by using the tidyverse
is the pipe operator (%>%
or |>
) from the magrittr
package (loaded as one of the tidyverse
s packages).
The function of a pipe is to pass the output of one function (on its left) directly as an input to another function (on its right). This reduces the need for difficult-to-read nested functions. For example, if I wanted to take the mean of all expression values in our data set and round this value to two significant figures:
## In base-R using nested functions
signif(mean(rna_tbl$expression, na.rm = TRUE), digits = 2)
## [1] 1900
## In the tidyverse using the pipe operator
%>%
rna_tbl pull(expression) %>%
mean(na.rm = TRUE) %>%
signif(digits = 2)
## [1] 1900
It is much easier to see step-by-step what the code is doing when we use the pipe operator. We have also introduced a new function - pull()
. This is a function that is commonly used in the tidyverse
to extract a specific column as a vector. We will see more of this soon.
dplyr
and tidyr
Both dplyr
and tidyr
are two of the packages installed and loaded within the tidyverse
umbrella package. The most common functions within the dplyr
package are:
select()
= select a subset of columns - output is a tibble
filter()
= filter for a subset of rows (based on their values in a column) - output is a tibble
mutate()
= create new columns that are functions of existing columns - output is a tibble
pull()
= extract values from a column as a vector
group_by()
= groups data based on a key (e.g., a factor)summarise()
= creates summary statistics on grouped data select()
and filter()
To select only the columns that we’re interested in we can make use of the select()
function. This can be particularly important when working with a tibble
that is very large since we have already discovered that printing a tibble
will only display the number of columns that fit on our screen - we want to make sure that these are the columns we are interested in.
Using the $
operator is quite convenient for selecting a single column and extracting the values as a vector. Selecting several columns using the []
subsetting operator is a bit more cumbersome.
## Subsetting columns based on []
c("gene", "expression", "age", "sex")] rna_tbl[,
## # A tibble: 32,428 × 4
## gene expression age sex
## <chr> <dbl> <dbl> <chr>
## 1 Asl 1170 8 Female
## 2 Apod 36194 8 Female
## 3 Cyp2d22 4060 8 Female
## 4 Klk6 287 8 Female
## 5 Fcrls 85 8 Female
## 6 Slc2a4 782 8 Female
## 7 Exd2 1619 8 Female
## 8 Gjc2 288 8 Female
## 9 Plp1 43217 8 Female
## 10 Gnb4 1071 8 Female
## # ℹ 32,418 more rows
## Subsetting columns using select()
%>%
rna_tbl select(gene, expression, age, sex)
## # A tibble: 32,428 × 4
## gene expression age sex
## <chr> <dbl> <dbl> <chr>
## 1 Asl 1170 8 Female
## 2 Apod 36194 8 Female
## 3 Cyp2d22 4060 8 Female
## 4 Klk6 287 8 Female
## 5 Fcrls 85 8 Female
## 6 Slc2a4 782 8 Female
## 7 Exd2 1619 8 Female
## 8 Gjc2 288 8 Female
## 9 Plp1 43217 8 Female
## 10 Gnb4 1071 8 Female
## # ℹ 32,418 more rows
We can also use the “-” sign to select all columns except for those that we are not interested in.
%>%
rna_tbl select(-gene, -expression)
Question: Take a look at the above code. What else differs when referring to columns in the tidyverse
?
Another thing that we previously used the []
for was conditional subsetting of rows. Again, the dplyr
package contains a convenient and tidy function for doing this - filter()
.
## Conditionally subsetting rows based on []
$gene == "Exd2", ]
rna_tbl[rna_tbl
## Conditionally subsetting rows using filter()
%>%
rna_tbl filter(gene == "Exd2")
The most convenient thing about using the select()
and filter()
functions is that their output (the subsetted data) can be piped straight into another function. Using the base-R []
subsetting we would either have to assign the subsetted data to a new object and then start a new command with this object, or create a difficult-to-read nested command.
Question: If I wanted to use both select()
and filter()
to get a subset of rows and columns, which order would be the best to do this in?
What is the difference between pull()
and select()
The pull()
function returns a single column as a vector; the select()
function returns one or more columns as a data frame. This means that we can pipe the output of pull()
straight into a function expecting numbers.
## Using pull to return a numeric vector
%>%
rna_tbl pull(expression) %>%
median(na.rm = TRUE)
## [1] 547
## Using select does not return a numeric vector
%>%
rna_tbl select(expression) %>%
median(na.rm = TRUE)
mutate()
It is very common to want to modify a column or create a new column based on existing columns. For example, we might want to convert units (multiply all values in a column by a conversion factor) or find the ratio between two columns. For this we can use the mutate()
function.
Let’s create a new column in our data which stores time in hours rather than days.
## Use mutate() to create new columns
%>%
rna_tbl mutate(time_hrs = time * 24) %>%
select(time, time_hrs)
## # A tibble: 32,428 × 2
## time time_hrs
## <dbl> <dbl>
## 1 8 192
## 2 8 192
## 3 8 192
## 4 8 192
## 5 8 192
## 6 8 192
## 7 8 192
## 8 8 192
## 9 8 192
## 10 8 192
## # ℹ 32,418 more rows
When the mutate()
function is executed, each new column is generated one-by-one. This means that we can actually create a second column based on our new column, all within a single command.
## Use mutate to create new columns
%>%
rna_tbl mutate(time_hrs = time * 24,
time_mins = time_hrs * 60) %>%
select(time, time_hrs, time_mins)
## # A tibble: 32,428 × 3
## time time_hrs time_mins
## <dbl> <dbl> <dbl>
## 1 8 192 11520
## 2 8 192 11520
## 3 8 192 11520
## 4 8 192 11520
## 5 8 192 11520
## 6 8 192 11520
## 7 8 192 11520
## 8 8 192 11520
## 9 8 192 11520
## 10 8 192 11520
## # ℹ 32,418 more rows
group_by()
and summarise()
Most data contains information about groups, and these are often defined by factors
. For example, in our data we have males and females, time points 0, 4 and 8 days and all of our genes. In some cases we may want to do analysis on a per-group basis. Hence, we split the data into groups, apply some analysis to each group, and then combine the results. This process is made easy by the group_by()
function.
%>%
rna_tbl group_by(sex)
## # A tibble: 32,428 × 19
## # Groups: sex [2]
## gene sample expression organism age sex infection strain time tissue mouse ENTREZID product
## <chr> <chr> <dbl> <chr> <dbl> <chr> <chr> <chr> <dbl> <chr> <dbl> <dbl> <chr>
## 1 Asl GSM2545… 1170 Mus mus… 8 Fema… Influenz… C57BL… 8 Cereb… 14 109900 argini…
## 2 Apod GSM2545… 36194 Mus mus… 8 Fema… Influenz… C57BL… 8 Cereb… 14 11815 apolip…
## 3 Cyp2d22 GSM2545… 4060 Mus mus… 8 Fema… Influenz… C57BL… 8 Cereb… 14 56448 cytoch…
## 4 Klk6 GSM2545… 287 Mus mus… 8 Fema… Influenz… C57BL… 8 Cereb… 14 19144 kallik…
## 5 Fcrls GSM2545… 85 Mus mus… 8 Fema… Influenz… C57BL… 8 Cereb… 14 80891 Fc rec…
## 6 Slc2a4 GSM2545… 782 Mus mus… 8 Fema… Influenz… C57BL… 8 Cereb… 14 20528 solute…
## 7 Exd2 GSM2545… 1619 Mus mus… 8 Fema… Influenz… C57BL… 8 Cereb… 14 97827 exonuc…
## 8 Gjc2 GSM2545… 288 Mus mus… 8 Fema… Influenz… C57BL… 8 Cereb… 14 118454 gap ju…
## 9 Plp1 GSM2545… 43217 Mus mus… 8 Fema… Influenz… C57BL… 8 Cereb… 14 18823 proteo…
## 10 Gnb4 GSM2545… 1071 Mus mus… 8 Fema… Influenz… C57BL… 8 Cereb… 14 14696 guanin…
## # ℹ 32,418 more rows
## # ℹ 6 more variables: ensembl_gene_id <chr>, external_synonym <chr>, chromosome_name <chr>,
## # gene_biotype <chr>, phenotype_description <chr>, hsapiens_homolog_associated_gene_name <chr>
We can see at the top of our tibble that we have two groups based on the factor sex
. We could similarly group by time point.
%>%
rna_tbl group_by(time)
## # A tibble: 32,428 × 19
## # Groups: time [3]
## gene sample expression organism age sex infection strain time tissue mouse ENTREZID product
## <chr> <chr> <dbl> <chr> <dbl> <chr> <chr> <chr> <dbl> <chr> <dbl> <dbl> <chr>
## 1 Asl GSM2545… 1170 Mus mus… 8 Fema… Influenz… C57BL… 8 Cereb… 14 109900 argini…
## 2 Apod GSM2545… 36194 Mus mus… 8 Fema… Influenz… C57BL… 8 Cereb… 14 11815 apolip…
## 3 Cyp2d22 GSM2545… 4060 Mus mus… 8 Fema… Influenz… C57BL… 8 Cereb… 14 56448 cytoch…
## 4 Klk6 GSM2545… 287 Mus mus… 8 Fema… Influenz… C57BL… 8 Cereb… 14 19144 kallik…
## 5 Fcrls GSM2545… 85 Mus mus… 8 Fema… Influenz… C57BL… 8 Cereb… 14 80891 Fc rec…
## 6 Slc2a4 GSM2545… 782 Mus mus… 8 Fema… Influenz… C57BL… 8 Cereb… 14 20528 solute…
## 7 Exd2 GSM2545… 1619 Mus mus… 8 Fema… Influenz… C57BL… 8 Cereb… 14 97827 exonuc…
## 8 Gjc2 GSM2545… 288 Mus mus… 8 Fema… Influenz… C57BL… 8 Cereb… 14 118454 gap ju…
## 9 Plp1 GSM2545… 43217 Mus mus… 8 Fema… Influenz… C57BL… 8 Cereb… 14 18823 proteo…
## 10 Gnb4 GSM2545… 1071 Mus mus… 8 Fema… Influenz… C57BL… 8 Cereb… 14 14696 guanin…
## # ℹ 32,418 more rows
## # ℹ 6 more variables: ensembl_gene_id <chr>, external_synonym <chr>, chromosome_name <chr>,
## # gene_biotype <chr>, phenotype_description <chr>, hsapiens_homolog_associated_gene_name <chr>
Now we have three groups based on the time
column.
Once we have grouped our data, subsequent operations will be carried out on a per-group basis. This is facilitated by the summarise()
function. Let’s try to find the mean expression across all genes at each time point.
%>%
rna_tbl group_by(time) %>%
summarise(mean_expression = mean(expression, na.rm = TRUE))
## # A tibble: 3 × 2
## time mean_expression
## <dbl> <dbl>
## 1 0 1888.
## 2 4 1872.
## 3 8 1820.
We can also group by more than one factor at once. For example, if we wanted the mean expression value for males and females at each timepoint.
%>%
rna_tbl group_by(sex, time) %>%
summarise(mean_expression = mean(expression, na.rm = TRUE))
## `summarise()` has grouped output by 'sex'. You can override using the `.groups` argument.
## # A tibble: 6 × 3
## # Groups: sex [2]
## sex time mean_expression
## <chr> <dbl> <dbl>
## 1 Female 0 1853.
## 2 Female 4 1892.
## 3 Female 8 1901.
## 4 Male 0 1934.
## 5 Male 4 1852.
## 6 Male 8 1712.
Similarly, once the data is grouped, we can also summarise several variables at the same time.
%>%
rna_tbl group_by(sex, time) %>%
summarise(mean_expression = mean(expression, na.rm = TRUE),
med_expression = median(expression, na.rm = TRUE))
## `summarise()` has grouped output by 'sex'. You can override using the `.groups` argument.
## # A tibble: 6 × 4
## # Groups: sex [2]
## sex time mean_expression med_expression
## <chr> <dbl> <dbl> <dbl>
## 1 Female 0 1853. 528.
## 2 Female 4 1892. 568
## 3 Female 8 1901. 557
## 4 Male 0 1934. 573
## 5 Male 4 1852. 535
## 6 Male 8 1712. 496.
Challenge: Using the tidyverse
for data manipulation and analysis
Use the dplyr()
functions that we have discussed to answer the following questions:
%>%
rna_tbl group_by(gene_biotype) %>%
summarise(number = length(time))
## # A tibble: 13 × 2
## gene_biotype number
## <chr> <int>
## 1 IG_C_gene 22
## 2 TEC 88
## 3 lncRNA 1518
## 4 miRNA 154
## 5 polymorphic_pseudogene 44
## 6 processed_pseudogene 1298
## 7 protein_coding 29062
## 8 scaRNA 22
## 9 snoRNA 110
## 10 transcribed_processed_pseudogene 22
## 11 transcribed_unitary_pseudogene 22
## 12 transcribed_unprocessed_pseudogene 22
## 13 unprocessed_pseudogene 44
## Can be done easily with count()
%>%
rna_tbl count(gene_biotype)
## # A tibble: 13 × 2
## gene_biotype n
## <chr> <int>
## 1 IG_C_gene 22
## 2 TEC 88
## 3 lncRNA 1518
## 4 miRNA 154
## 5 polymorphic_pseudogene 44
## 6 processed_pseudogene 1298
## 7 protein_coding 29062
## 8 scaRNA 22
## 9 snoRNA 110
## 10 transcribed_processed_pseudogene 22
## 11 transcribed_unitary_pseudogene 22
## 12 transcribed_unprocessed_pseudogene 22
## 13 unprocessed_pseudogene 44
%>%
rna_tbl filter(gene_biotype == "lncRNA" | gene_biotype == "miRNA") %>%
group_by(sex, gene_biotype) %>%
summarise(mean_exp = mean(expression, na.rm = TRUE))
## `summarise()` has grouped output by 'sex'. You can override using the `.groups` argument.
## # A tibble: 4 × 3
## # Groups: sex [2]
## sex gene_biotype mean_exp
## <chr> <chr> <dbl>
## 1 Female lncRNA 800.
## 2 Female miRNA 9.02
## 3 Male lncRNA 190.
## 4 Male miRNA 11.4
%>%
rna_tbl filter(phenotype_description == "abnormal DNA methylation") %>%
group_by(gene, time) %>%
summarise(mean_exp = mean(expression, na.rm = TRUE)) %>%
mutate(log_mean_exp = log(mean_exp))
## `summarise()` has grouped output by 'gene'. You can override using the `.groups` argument.
## # A tibble: 6 × 4
## # Groups: gene [2]
## gene time mean_exp log_mean_exp
## <chr> <dbl> <dbl> <dbl>
## 1 Xist 0 24164 10.1
## 2 Xist 4 20184. 9.91
## 3 Xist 8 24609. 10.1
## 4 Zdbf2 0 534. 6.28
## 5 Zdbf2 4 570. 6.35
## 6 Zdbf2 8 496. 6.21