Saturday, 19 November 2022

VOLCANO PLOTS CUSTOMIZATION IN R

 

Volcano plots became famous with big data analysis. Whether you are working with NGS, mass spectrometry data or similar, you might need to build this kind of plot.



The 'ggplot2' package is very useful for this purpose. I found a very nice post where instructions to build a basic volcano plot are provided, just have a look. I will start from there and add some lines of code to customize the plot, including how adding the lines of significance and highlighting specific genes of interest.

It’s gonna be a bit long guiding you step-by-step, but I hope you’ll find it useful!


Let's start with the packages:

library(dplyr)
library(ggplot2)
library (ggrepel)
library(dslabs)


From the post I mentioned, I’ll use this function to have a nicer output for showing data (it is slightly modified compared to the original):

knitr_table <- function(x) {
                   x %>%
                   knitr::kable(format = "rst", digits = Inf,
                   format.args = list(big.mark = ","))
                            }


… and also the source of data:

data <- read_tsv("https://raw.githubusercontent.com/sdgamboa/misc_datasets/master/L0_vs_L20.tsv")

dim(data) #it shows the data dimensions (rows, columns)

head(data) %>%
    knitr_table() #to show the first 6 lines





These are analysed data from RNA sequencing, so you can find the variables of interest for building a volcano plot:

Genes: it contains the names of each gene

logFC: it contains the logarithm base 2 (log2) of the fold change of each gene

PValue: it contains p-values for each comparison

FDR: false discovery rate


We will use the FDR column, whose values need to be converted to their negative logarithm base 10 [log10(FDR)]. We’ll do it later, in the ‘ggplot()’ function.



DIFFERENTIALLY EXPRESSED GENES (DEGs)


Now that we got the data, it is necessary to add a column with the indication about the differentially expressed genes (DEGs). Usually, they are those with an FDR equal to or lower than 0.05, and a fold change equal to or higher than 2.

Therefore, we’ll add the variable “Expression” where classifying the genes as up-/down-regulated or unchanged according to the FDR and fold change:

data$Expression <- 
        ifelse((data$FDR <= 0.05 & data$logFC <= -1), "Down-regulated", 
                ifelse((data$FDR <= 0.05 & data$logFC >= 1), "Up-regulated",
                "Unchanged"))               

Where 1 is the log2(2).

Now, it is possible to count how many genes are up-/down-regulated or unchanged:

data %>%
    count(Expression)
    knitr_table()



… as well as to get a data frame with up-regulated genes and down-regulated genes, in descending order based on logFC. It can be exported if needed:

up <- data %>%
           filter(Expression == "Up-regulated") %>%
           arrange(desc(logFC))

down <- data %>%
           filter(Expression == "Down-regulated") %>%
           arrange(desc(abs(logFC)))

head(up) %>%
    knitr_table()

library(rio)
export(up, "up-regulated genes.xlsx") #to export the table to an excel file in the working directory

You can do the same with down-regulated genes.


BUILDING THE BASIC VOLCANO PLOT


Now we will build the basic volcano plot, remembering to convert the FDR to its log base 10 and assigning the colour to “Expression”:

p1 <- ggplot(data, aes(logFC, -log(FDR,10))) +
      geom_point(aes(color = Expression), size = 1.8, shape = 19, alpha = 0.3) +
      xlab(expression("log"[2]*"FC")) +
      ylab(expression("-log"[10]*"FDR")) + 
      scale_color_manual(values = c("deeppink1","orangered", "gray50")) +
      theme_bw()

windows(width=10, height=7); par(lwd=1, las=1, family="sans", cex=1,
  mgp=c(3.0,1,0))

p1




ADDING THE LINES FOR SIGNIFICANCE


We first have to set the values for the lines on the x and y axes:

y <- -log(0.05,10)  #this is for the FDR, the log base 10 of 0.05

x <- log(2, 2)  #this is for the fold change, the log base 2 of 2

p2 <- p1 +

      geom_hline(yintercept = y, col = "black", linetype = "dotdash") + #add an horizontal line at point y
      geom_vline(xintercept = c(x, -x), col = "black", linetype = "dotdash") #add a vertical line at point                                                                                        x and -x

windows(width=10, height=7); par(lwd=1, las=1, family="sans", cex=1,
  mgp=c(3.0,1,0))

p2



It calls to mind Japanese cherry blossom, doesn't it?

As a final step, you might want to highlight some genes of interest and add the labels.  



SELECTED GENES


Let's select those genes with 'filter()' in the 'dlpyr' package: 

labels <- data %>%
          filter(Genes == "57_45" | Genes == "1092_3" | Genes == "144_7" |
                 Genes == "468_3" | Genes == "29_188" | Genes == "1192_1")

Now we'll over-write them in the plot we built: 

windows(width=10, height=7); par(lwd=1, las=1, family="sans", cex=1,
  mgp=c(3.0,1,0))

p2 +
  geom_point(data = labels,
             mapping = aes(logFC, -log(FDR,10)), size = 1.8) + #this is for the points
  geom_label_repel(data = labels,
                   mapping = aes(logFC, -log(FDR,10), label = Genes), size = 2) #this is for the labels





Basically, inside the ‘geom_point()’ or the ‘geom_label_repel()’ the data you specify are the ones included in the previously prepared subset ‘labels’ and the mapping corresponds to the x and y of each point.

The ‘geom_label_repel()’ function, will move slightly the labels to avoid their overlapping. You can change the colour, size and shape of the highlighted points with the specific arguments in ’geom_point()’.



TOP GENES

To highlight the top genes we can use the ordered up-regulated and down-regulated genes we previously saved in the objects “up” and “down”.

We’ll select only the top 10 genes for up-regulated and down-regulated ones, and put them together in a data frame:

top_genes <- bind_rows(up %>% head(n=10),
                       down %>% head(n=10))

top_genes %>%
    knitr_table()

Now we can use this data frame similarly to the previous example: 

windows(width=10, height=7); par(lwd=1, las=1, family="sans", cex=1,
  mgp=c(3.0,1,0))

p2 +
  geom_point(data = top_genes,
             mapping = aes(logFC, -log(FDR,10)), size = 1.8) + 
  geom_label_repel(data = top_genes,
             mapping = aes(logFC, -log(FDR,10), label = Genes), size = 2)


 



SINGLE GENE


Finally, it is possible to highlight a single gene in this way:

windows(width=10, height=7); par(lwd=1, las=1, family="sans", cex=1,
  mgp=c(3.0,1,0))

p2 +
  geom_point(data = subset(data, Genes == "144_7"),
             mapping = aes(logFC, -log(FDR,10)), size = 1.8) +
  geom_label_repel(data = subset(data, Genes == "144_7"),
                   mapping = aes(logFC, -log(FDR,10), label = Genes), size = 2)




I hope this post will be useful as a kick-start. You can further customize the plot according to your needs, you have plenty of options. 
Have fun!

@colour credits: Peruzzella&Cinderella®