Demonstration of limmaDE2

Overview

The R package limmaDE2 contains a set of functions designed to make running differential expression analyses in LIMMA more user friendly. There are several functions of this package:

pipeLIMMA Run analyses of differnetial expression in LIMMA.
makeBinarySig Find and count the number of significantly differentially expressed genes
pqHists Plot the distribution of P and FDR-corrected P (Q) values
voom2PCA Conduct principal component analyses on voom (or otherwise) normalized expression matrices
volcanoPlot Generate log2 foldchange vs. P-value volcano plots
volcanoPair Compare two sets of log2 fold changes, colored by p-values
pipeDESeq2 Runs an analysis of differential expression similar to that of pipeLIMMA, except through the DESeq2 package.

1. Installation

The limmaDE2 packages is not on CRAN and can only be installed from github. Make sure the devtools package is installed on your system. The limmaDE2 package also requires several other R packages to be installed. These packages have functions upon which limmaDE2 depends:

Essential Packages:

limma contains all of the basic statistical elements that are called here.
qvalue fdr-corrected p-values
edgeR Differential expression package used for some transformations

Data for this vignette comes from:

SimSeq permits simulation of RNA-seq data

Other useful packages:

plyr summarize and manipulate lists and dataframes
vegan some distance-based methods
Heatplus heatmap plotting
topGO gene ontology (GO) analysis
ggplot2 routines for complex plots
hexbin methods to simplify data in plots

To install limmaDE2:

library(devtools)
install_github("jtlovell/limmaDE2")

Load all packages that will be needed for the tutorial:

library("limmaDE2")
library("ggplot2")
library("reshape2")
library("SimSeq")
library("DESeq2")

2. Importing data and checking

For limmaDE2 to work, two matrices are needed:

counts: Raw transcript abundance counts
info: Experimental design information These two matrices must match exactly, where each row in the info matrix corresponds to each column in counts. For best performance, the name of each gene should be stored in the rownames of the counts matrix.

data(kidney) # from simseq
counts<-kidney$counts
counts<-counts[sample(1:nrow(counts),1000),]
info<-with(kidney, 
           data.frame(id = paste(replic, treatment, sep = "_"), 
                      rep=replic, 
                      Treatment=ifelse(treatment == "Tumor","tumor","cntr"), 
                      stringsAsFactors=F))
colnames(counts)<-info$id

Add in another category that gives us a factorial design

group.a<-c("4619", "4712", "4863", "4865", "5452", "5453", "5454", "5455",
          "5456", "5457","5458", "5461", "5462", "5463", "5465", "5467",
          "5468", "5469", "5470", "5549","5552", "5580", "5641", "5672",
          "5689", "5701", "5703", "5706", "5989", "6088")
info$group<-ifelse(info$rep %in% group.a, "a","b")
with(info, table(group, Treatment))

##      Treatment
## group cntr tumor
##     a   30    30
##     b   42    42

info$trt.grp<-with(info, paste(Treatment, group, sep="_"))

The top of the example experimental design dataset

head(info)

##               id  rep Treatment group trt.grp
## 1 3358_Non-Tumor 3358      cntr     b  cntr_b
## 2     3358_Tumor 3358     tumor     b tumor_b
## 3 3387_Non-Tumor 3387      cntr     b  cntr_b
## 4     3387_Tumor 3387     tumor     b tumor_b
## 5 4619_Non-Tumor 4619      cntr     a  cntr_a
## 6     4619_Tumor 4619     tumor     a tumor_a

What the example counts dataset looks like

counts[1:10,1:3]

##               3358_Non-Tumor 3358_Tumor 3387_Non-Tumor
## GSK3B|2932              2096       3973           2310
## TEC|7006                  88        246            125
## RTBDN|83546                0          0              0
## SNX13|23161             3139       4764           2798
## MAST1|22983                7          1              0
## TFAM|7019               1436       1697           1550
## UBXN10|127733            401       1581            260
## KRT14|3861                 0         28              2
## CINP|51550               993       1401           1104
## TACR1|6869                80        245             49

In all statistical analyses, it is important to set the levels of the experimental factors. This is esspecially true in linear modelling, such as limma, where levels are tested against the base level. Here, we set the “Non-Tumor” treatment and “a” group as the base.

info$Treatment <- factor(info$Treatment,
                           levels = c("cntr", "tumor"))
info$group <- factor(info$group,
                           levels = c("a", "b"))

3. Basic analysis of differential expression

To test the plasticity (differential expression) of all genes between the tumor and control tissue:

stats <- pipeLIMMA(counts = counts, 
                   info = info, 
                   block = NULL, 
                   formula = "~ Treatment")

## calculating normalization factors ... 
## running voom normalization correcting for quality weights ... 
## fitting linear model ... 
## processing statistics and calculating q-values ...

lmStats<-stats$stats
voom<-stats$voom$E

pipeLIMMA returns three elements: stats, voom and fstats. These are the statistical output of the limma functions: eBayes, voom and topTableF. Inspect the top few rows and columns of each.

`stats`: eBayes linear model statistics

	gene	sigma	s2.post	Amean	Fstat	Fpvalue
GSK3B\|2932	GSK3B\|2932	0.8001060	0.6416126	10.281119	17.4070785	0.0000518
TEC\|7006	TEC\|7006	0.4726687	0.2330547	5.935553	47.5471968	0.0000000
RTBDN\|83546	RTBDN\|83546	1.2190479	1.4708851	-1.752828	0.1332056	0.7156636
SNX13\|23161	SNX13\|23161	0.7140688	0.5138993	10.420350	23.7538159	0.0000028
MAST1\|22983	MAST1\|22983	1.2083085	1.4453294	1.163619	5.4418431	0.0210366
TFAM\|7019	TFAM\|7019	0.6495017	0.4275889	9.127099	166.6738831	0.0000000

`voom`: normalized expression matrix

	3358_Non-Tumor	3358_Tumor	3387_Non-Tumor	3387_Tumor	4619_Non-Tumor	4619_Tumor
GSK3B\|2932	10.135067	10.202899	10.235435	10.402782	9.9837220	10.808928
TEC\|7006	5.568906	6.192148	6.032989	5.679887	5.2827638	6.604529
RTBDN\|83546	-1.898700	-2.753296	-1.938555	-2.690800	-1.4587032	-2.126790
SNX13\|23161	10.717619	10.464813	10.511883	10.409041	10.4515648	12.185029
MAST1\|22983	2.008191	-1.168334	-1.938555	1.557127	0.1262593	0.680565
TFAM\|7019	9.589642	8.975900	9.659963	8.963388	9.6426160	9.449222

`fstats`: F-statistics for each factor in the model, in this case, just treatment

gene	Treatment_F	Treatment_P.Value	Treatment_Q.Value
AATF\|26574	128.433398	0.0000000	0.0000000
ABCB8\|11194	69.822932	0.0000000	0.0000000
ABCC6P1\|653190	2.194034	0.1407181	0.1820415
ABCC6P2\|730013	1.154329	0.2844313	0.3502849
ACAD11\|84129	49.899739	0.0000000	0.0000000
ACCN2\|41	96.479576	0.0000000	0.0000000

4. Differential expression using a blocking variable

Lets say that here, there is some sort of experimental blocking. In this case, we employ a routine in limma that calculates the `duplicateCorrelation` among blocks, then uses the blocking variable in the linear model fit.

info$block <- rep(1:2,each=nrow(info)/2)
stats.block <- pipeLIMMA(counts = counts, 
                   info = info, 
                   block = info$block,
                   formula = "~ Treatment")

## calculating normalization factors ... 
## running voom normalization correcting for quality weights ... 
## calculating duplicate correlation among replicates ... 
## fitting linear model ... 
## processing statistics and calculating q-values ...

5. Differential expression in a factorial experiment

stats.factorial <- pipeLIMMA(counts = counts, 
                   info = info, 
                   block = NULL, 
                   formula = "~ Treatment + group + Treatment*group")

## calculating normalization factors ... 
## running voom normalization correcting for quality weights ... 
## fitting linear model ... 
## processing statistics and calculating q-values ...

6. Calculating differential expression using specific contrasts

Sometimes, it may make more sense to use specific contrasts to test for differential expression. For example, let’s say we are interested in how the tumors and non-tumor tissues differ for each of the two groups.

First, we need to construct a design matrix.

design <- with(info, model.matrix(~ 0 + trt.grp))
colnames(design)<-gsub("trt.grp","",colnames(design))
head(design)

##   cntr_a cntr_b tumor_a tumor_b
## 1      0      1       0       0
## 2      0      0       0       1
## 3      0      1       0       0
## 4      0      0       0       1
## 5      1      0       0       0
## 6      0      0       1       0

Then we construct a contrast matrix.

contrast.matrix <- makeContrasts(
  tumor_a - cntr_a , 
  tumor_b - cntr_b,
  levels = design)
head(contrast.matrix)

##          Contrasts
## Levels    tumor_a - cntr_a tumor_b - cntr_b
##   cntr_a                -1                0
##   cntr_b                 0               -1
##   tumor_a                1                0
##   tumor_b                0                1

Finally, we fit the model with a contrast matrix as the design, overriding the formula argument.

stats <- pipeLIMMA(counts = counts, 
                   info = info, 
                   block = NULL, 
                   design = design, 
                   contrast.matrix = contrast.matrix)

## calculating normalization factors ... 
## running voom normalization correcting for quality weights ... 
## fitting model to contrast matrix ... 
## processing statistics and calculating q-values ...

stats.contrasts <- stats$stats

6. Count and plot the number of significantly differentially expressed genes

The function makeBinarySig looks for a provided string (e.g. “Q.Value”) and outputs a matrix with whether or not those columns have values <= the provided alpha

sigs <- makeBinarySig(stats.contrasts, what = "Qvalue", alpha = 0.05)

## ebayesQvalue_tumor_a...cntr_a    627 
## ebayesQvalue_tumor_b...cntr_b    665

Make a venn diagram of number of differentially expressed genes among the experimental factors

counts2Venn(x=sigs, cols=c(1,2), names=c("in.grpA","in.grpB"),
   colors=c("blue","red"),type="limma", legx=-3.3,legy=-3)

Make a euler diagram of number of differentially expressed genes among the experimental factors

counts2Venn(x=sigs, cols=c(1,2), names=c("in.grpA","in.grpB"),
   colors=c("blue","red"),type="Euler", legx=-3.3,legy=-3)

7. Make a volcano plot of the results

Volcano plots are a good way to look at the differences between two experimental levels. Here, we compare the extent of differential expression between the “high” treatment to the “low” treatment.

with(lmStats, volcanoPlot(pval = ebayesPvalue_Treatmenttumor,
                          lfc = Treatmenttumor_logFC,
                          sig = ebayesQvalue_Treatmenttumor,
                          main = "no tumor vs. tumor Volcano Plot", 
                          xlab = "tumor - no tumor Log2 Fold Change",
                          bty = "n", legpos = "top", leginset = c(0,-.1)))

##       sig
## updown FALSE TRUE
##   down   130  385
##   up     168  317

8. Using two contrasts, make a pairwise volcano plot

Custom colors

sigs <- data.frame(makeBinarySig(stats.contrasts, what = "Qvalue", alpha = 0.05))

## ebayesQvalue_tumor_a...cntr_a    627 
## ebayesQvalue_tumor_b...cntr_b    665

names(sigs)<-c("sig.a","sig.b")
sigs$sign.a<-sign(stats.contrasts$tumor_a...cntr_a_logFC)
sigs$sign.b<-sign(stats.contrasts$tumor_b...cntr_b_logFC)

cols<-with(sigs, ifelse(sig.a + sig.b == 0,  
                        "grey",
                        ifelse(sig.a + sig.b == 2 & sign.a*sign.b == 1, 
                               "pink",
                               ifelse(sig.a + sig.b == 2 & sign.a*sign.b == -1,
                                      "cornflowerblue",
                                      ifelse(sig.a == 1, "darkblue", "darkred")))))

Make the plot

with(stats.contrasts, volcanoPair(lfc1 = tumor_a...cntr_a_logFC,
                                  lfc2 = tumor_b...cntr_b_logFC,
                                  pt.col = cols, pt.pch = 19, pt.cex=.5,
                                  xlab = "Tumor - control LFC (group A)",
                                  ylab = "Tumor - control LFC (group B)"))

## pt.col
## cornflowerblue       darkblue        darkred           grey           pink 
##             28            123            161            212            476

Principal component analysis of gene expression

It is often a good idea to get an idea of how the individuals (libraries) are structured - how similar are they to eachother. Principal component analyses allow for this kind of inference.

pca12 <- counts2PCA(counts=voom, info = info, ids = info$id, pcas2return = 6)

pca12.var <- pca12[[2]]
pca12 <- pca12[[1]]
gcols <- c("darkblue", "blue", "gold", "green", "pink", "red")
ggplot(pca12, aes(x = PC1, y = PC2, col = group, shape = Treatment)) +
  geom_vline(xintercept = 0, lty = 2)+
  geom_hline(yintercept = 0, lty = 2)+
  geom_point() +
  theme_bw() +
  theme(panel.grid.major = element_blank(),
        panel.grid.minor = element_blank(),
        strip.background = element_blank()) +
  labs(x = paste("PCA1 (", pca12.var[1],"%)", sep = ""),
       y = paste("PCA2 (", pca12.var[2],"%)", sep = ""),
       title = "PCA analysis of voom-normalized expression")

For more information, visit lovelleeb.weebly.com/analytics

Demonstration of limmaDE2

JT Lovell

2016-11-04

Overview

The R package limmaDE2 contains a set of functions designed to make running differential expression analyses in LIMMA more user friendly. There are several functions of this package:

1. Installation

Essential Packages:

Data for this vignette comes from:

Other useful packages:

To install limmaDE2:

Load all packages that will be needed for the tutorial:

2. Importing data and checking

For limmaDE2 to work, two matrices are needed:

Add in another category that gives us a factorial design

The top of the example experimental design dataset

What the example counts dataset looks like

3. Basic analysis of differential expression

To test the plasticity (differential expression) of all genes between the tumor and control tissue:

`stats`: eBayes linear model statistics

`voom`: normalized expression matrix

`fstats`: F-statistics for each factor in the model, in this case, just treatment

4. Differential expression using a blocking variable

Lets say that here, there is some sort of experimental blocking. In this case, we employ a routine in limma that calculates the `duplicateCorrelation` among blocks, then uses the blocking variable in the linear model fit.

5. Differential expression in a factorial experiment

6. Calculating differential expression using specific contrasts

First, we need to construct a design matrix.

Then we construct a contrast matrix.

Finally, we fit the model with a contrast matrix as the design, overriding the formula argument.

6. Count and plot the number of significantly differentially expressed genes

Make a venn diagram of number of differentially expressed genes among the experimental factors

Make a euler diagram of number of differentially expressed genes among the experimental factors

7. Make a volcano plot of the results

8. Using two contrasts, make a pairwise volcano plot

Custom colors

Make the plot

Principal component analysis of gene expression

Demonstration of limmaDE2

JT Lovell

2016-11-04

Overview

The R package limmaDE2 contains a set of functions designed to make running differential expression analyses in LIMMA more user friendly. There are several functions of this package:

1. Installation

Essential Packages:

Data for this vignette comes from:

Other useful packages:

To install limmaDE2:

Load all packages that will be needed for the tutorial:

2. Importing data and checking

For limmaDE2 to work, two matrices are needed:

Add in another category that gives us a factorial design

The top of the example experimental design dataset

What the example counts dataset looks like

3. Basic analysis of differential expression

To test the plasticity (differential expression) of all genes between the tumor and control tissue:

stats: eBayes linear model statistics

voom: normalized expression matrix

fstats: F-statistics for each factor in the model, in this case, just treatment

4. Differential expression using a blocking variable

Lets say that here, there is some sort of experimental blocking. In this case, we employ a routine in limma that calculates the duplicateCorrelation among blocks, then uses the blocking variable in the linear model fit.

5. Differential expression in a factorial experiment

6. Calculating differential expression using specific contrasts

First, we need to construct a design matrix.

Then we construct a contrast matrix.

Finally, we fit the model with a contrast matrix as the design, overriding the formula argument.

6. Count and plot the number of significantly differentially expressed genes

Make a venn diagram of number of differentially expressed genes among the experimental factors

Make a euler diagram of number of differentially expressed genes among the experimental factors

7. Make a volcano plot of the results

8. Using two contrasts, make a pairwise volcano plot

Custom colors

Make the plot

Principal component analysis of gene expression

`stats`: eBayes linear model statistics

`voom`: normalized expression matrix

`fstats`: F-statistics for each factor in the model, in this case, just treatment

Lets say that here, there is some sort of experimental blocking. In this case, we employ a routine in limma that calculates the `duplicateCorrelation` among blocks, then uses the blocking variable in the linear model fit.