Here, we illustrate how to use the iasva package to detect cell cycle stage difference within single cell RNA sequencing data. We use single cell RNA sequencing (scRNA-Seq) data obtained from human glioblastoma samples (Petel et. al., 2014). This dataset is included in a R data package (“iasvaExamples”) containing data examples for IA-SVA (https://github.com/dleelab/iasvaExamples). To install the package, follow the instruction provided in the GitHub page.
#devtools
library(devtools)
#iasva
devtools::install_github("UcarLab/IA-SVA")
#iasvaExamples
devtools::install_github("dleelab/iasvaExamples")
rm(list=ls())
library(irlba) # partial SVD, the augmented implicitly restarted Lanczos bidiagonalization algorithm
library(iasva)
library(iasvaExamples)
library(sva)
library(Rtsne)
library(pheatmap)
library(corrplot)
library(DescTools) #pcc i.e., Pearson's contingency coefficient
library(RColorBrewer)
color.vec <- brewer.pal(3, "Set1")
data("Patel_Glioblastoma_scRNAseq_Read_Counts")
data("Patel_Glioblastoma_scRNAseq_Annotations")
ls()
## [1] "color.vec"
## [2] "Patel_Glioblastoma_scRNAseq_Annotations"
## [3] "Patel_Glioblastoma_scRNAseq_Read_Counts"
counts <- Patel_Glioblastoma_scRNAseq_Read_Counts
anns <- Patel_Glioblastoma_scRNAseq_Annotations
dim(anns)
## [1] 434 3
dim(counts)
## [1] 25415 434
summary(anns)
## run patient_id subtype
## SRR1294493: 1 MGH26:118 None :120
## SRR1294494: 1 MGH28: 95 Mes :103
## SRR1294496: 1 MGH29: 76 Pro : 89
## SRR1294498: 1 MGH30: 74 Cla : 46
## SRR1294499: 1 MGH31: 71 Neu : 24
## SRR1294500: 1 Pro+Cla: 20
## (Other) :428 (Other): 32
table(anns$patient_id, anns$subtype)
##
## Cla Cla+Mes Mes Neu Neu+Cla Neu+Mes None Pro Pro+Cla Pro+Neu
## MGH26 10 0 0 1 1 0 19 71 14 2
## MGH28 1 5 56 0 0 7 21 5 0 0
## MGH29 0 0 28 12 0 12 19 4 0 1
## MGH30 33 1 8 1 1 0 16 6 6 2
## MGH31 2 0 11 10 0 0 45 3 0 0
ContCoef(table(anns$patient_id, anns$subtype))
## [1] 0.723431
The annotations describing the glioblastoma samples and experimental settings are stored in “anns” and the read counts information is stored in “counts”.
We use read counts of glioblastoma cells from Patient MGH30 (n = 58).
counts <- counts[, (anns$subtype!="None")&(anns$patient_id=="MGH30")]
anns <- subset(anns, (subtype!="None")&(patient_id=="MGH30"))
dim(counts)
[1] 25415 58
dim(anns)
[1] 58 3
anns <- droplevels(anns)
prop.zeros <- sum(counts==0)/length(counts)
prop.zeros
[1] 0.6279118
# filter out genes that are sparsely and lowly expressed
filter = apply(counts, 1, function(x) length(x[x>5])>=3)
counts = counts[filter,]
dim(counts)
[1] 21151 58
prop.zeros <- sum(counts==0)/length(counts)
prop.zeros
[1] 0.5566493
Subtype <- anns$subtype
Patient_ID <- anns$patient_id
It is well known that the geometric library size (i.e., library size of log-transfromed read counts) or proportion of expressed genes in each cell explains a very large portion of variability of scRNA-Seq data (Hicks et. al. 2015 BioRxiv, McDavid et. al. 2016 Nature Biotechnology). Frequently, the first principal component of log-transformed scRNA-Seq read counts is highly correlated with the geometric library size (r ~ 0.9). Here, we calculate the geometric library size vector, which will be used as a known factor in the IA-SVA algorithm.
Geo_Lib_Size <- colSums(log(counts+1))
barplot(Geo_Lib_Size, xlab="Cell", ylab="Geometric Lib Size", las=2)
lcounts <- log(counts + 1)
pc1 = irlba(lcounts - rowMeans(lcounts), 1)$v[,1] ## partial SVD
cor(Geo_Lib_Size, pc1)
## [1] 0.9595776
Here, we run IA-SVA using Geo_Lib_Size as a known factor and identify five hidden factors. SVs are plotted in a pairwise fashion to uncover which SVs can seperate cells.
set.seed(345)
mod <- model.matrix(~Geo_Lib_Size)
iasva.res<- iasva(t(counts), mod[,-1],verbose=FALSE, permute=FALSE, num.sv=5) ## irlba
## IA-SVA running...
## SV1 Detected!
## SV2 Detected!
## SV3 Detected!
## SV4 Detected!
## SV5 Detected!
## # of significant surrogate variables: 5
iasva.sv <- iasva.res$sv
Cell_Cycle <- as.factor(iasva.sv[,2] > -0.1)
levels(Cell_Cycle)=c("Cycle1","Cycle2")
table(Cell_Cycle)
## Cell_Cycle
## Cycle1 Cycle2
## 12 46
pairs(iasva.sv[,1:5], main="IA-SVA", pch=21, col=color.vec[Cell_Cycle], bg=color.vec[Cell_Cycle], oma=c(4,4,6,14))
legend("right", levels(Cell_Cycle), fill=color.vec, bty="n")
plot(iasva.sv[,1:2], main="IA-SVA", pch=21, xlab="SV1", ylab="SV2", col=color.vec[Cell_Cycle], bg=color.vec[Cell_Cycle])
#legend("bottomright", levels(Cell_Cycle), fill=color.vec, bty="n")
cor(Geo_Lib_Size, iasva.sv[,2])
## [1] -0.4870356
corrplot(cor(iasva.sv))
As shown in the above figure, SV2 clearly separates glioblastoma cells into two groups: 12 cells (marked in red) and the rest of the cells (marked in blue). Note that SV2 is moderately correlated with the geometric library size (|r|=0.49). SV5 also captures an outlier cell. However, we will focus on SV2 in the rest of the analyses.
Here, using the find.markers() function we find marker genes (n=87 genes) that are significantly associated with SV2 (multiple testing adjusted p-value < 0.05, default significance cutoff, and R-squared value > 0.4).
marker.counts <- find.markers(t(counts), as.matrix(iasva.sv[,2]), rsq.cutoff = 0.4)
## # of markers (): 87
## total # of unique markers: 87
nrow(marker.counts) #87 58
## [1] 87
rownames(marker.counts)
## [1] "NCAPD2" "TACC3" "SPAG5" "UBE2T"
## [5] "NDC80" "RAD54L" "TPX2" "BIRC5"
## [9] "KIF4A" "ORC6" "CLSPN" "CDC7"
## [13] "CENPM" "ASF1B" "NCAPG" "FOXM1"
## [17] "TIMELESS" "CDCA3" "ECT2" "CENPF"
## [21] "KIF14" "NCAPH" "MND1" "KIF18A"
## [25] "ZWINT" "HJURP" "DLGAP5" "FAM64A"
## [29] "SGOL1" "TOP2A" "CCNB1" "CDCA8"
## [33] "TROAP" "TCF19" "NRM" "NUSAP1"
## [37] "KIF23" "CASC5" "KIF20B" "CENPE"
## [41] "KIF2C" "NUF2" "FANCD2" "TIFA"
## [45] "CDCA5" "NCAPG2" "MKI67" "SPC25"
## [49] "SKA1" "EME1" "BUB1B" "CCNB2"
## [53] "CDC25C" "RACGAP1" "SPC24" "KIF15"
## [57] "POC1A" "MAD2L1" "PTTG1" "MELK"
## [61] "C11orf82" "CENPN" "TK1" "PBK"
## [65] "CKAP2L" "BUB1" "CDK1" "SHCBP1"
## [69] "ESCO2" "RRM2" "UBE2C" "TYMS"
## [73] "AURKB" "TRAIP" "IQGAP3" "PARPBP"
## [77] "XRCC2" "HMGN2" "PRC1" "KIF4B"
## [81] "PTTG2" "RP11-192H23.4" "RP11-133K1.2" "RP11-798K3.2"
## [85] "RP11-143J24.1" "CTD-2510F5.6" "CTB-175P5.4"
anno.col <- data.frame(Subtype=Subtype, Cell_Cycle=Cell_Cycle, Lib_Size=colSums(counts))
rownames(anno.col) <- colnames(marker.counts)
head(anno.col)
## Subtype Cell_Cycle Lib_Size
## SRR1294928 Pro+Cla Cycle2 63371814
## SRR1294930 Pro Cycle2 91372430
## SRR1294931 Cla Cycle2 58507598
## SRR1294935 Cla Cycle2 137157795
## SRR1294936 Cla Cycle2 22486574
## SRR1294938 Cla Cycle2 67682923
pheatmap(log(marker.counts+1), show_colnames =FALSE, clustering_method = "ward.D2",cutree_cols = 2,annotation_col = anno.col)
Theses marker genes are strongly enriched in cell-cycle related Go terms and KEGG pathways. (See Supplementary Figure 6 in https://doi.org/10.1101/151217)
Here, we apply tSNE on the marker genes for SV2 obtained from IA-SVA
set.seed(34523)
tsne.res <- Rtsne(unique(t(log(marker.counts+1))), dims = 2, perplexity = 15)
plot(tsne.res$Y, main="tSNE post IA-SVA", xlab="tSNE Dim1", ylab="tSNE Dim2", pch=21, col=color.vec[Cell_Cycle], bg=color.vec[Cell_Cycle], oma=c(4,4,6,12))
legend("bottomright", levels(Cell_Cycle), fill=color.vec, bty="n")
cor(Geo_Lib_Size, iasva.sv[,2])
## [1] -0.4870356
By allowing correlation between factors, IA-SVA accurately detects the cell cycle stage difference, which is moderately correlated (|r|=0.49) with the geometric library size (the first principal component). Existing methods fail to detect the heterogeneity due to the orthogonality assumption.
sessionInfo()
## R version 3.3.2 (2016-10-31)
## Platform: x86_64-apple-darwin13.4.0 (64-bit)
## Running under: macOS Sierra 10.12.5
##
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] RColorBrewer_1.1-2 DescTools_0.99.19 corrplot_0.77
## [4] pheatmap_1.0.8 Rtsne_0.11 sva_3.22.0
## [7] genefilter_1.56.0 mgcv_1.8-16 nlme_3.1-128
## [10] iasvaExamples_0.1.0 iasva_0.95 irlba_2.2.1
## [13] Matrix_1.2-7.1 knitr_1.14 devtools_1.12.0
##
## loaded via a namespace (and not attached):
## [1] Rcpp_0.12.10 plyr_1.8.4 formatR_1.4
## [4] bitops_1.0-6 tools_3.3.2 boot_1.3-18
## [7] digest_0.6.12 manipulate_1.0.1 gtable_0.2.0
## [10] annotate_1.52.1 evaluate_0.10 memoise_1.0.0
## [13] RSQLite_1.1-2 lattice_0.20-34 DBI_0.6-1
## [16] yaml_2.1.14 parallel_3.3.2 expm_0.999-2
## [19] mvtnorm_1.0-6 withr_1.0.2 stringr_1.2.0
## [22] S4Vectors_0.12.2 IRanges_2.8.2 stats4_3.3.2
## [25] rprojroot_1.2 grid_3.3.2 Biobase_2.34.0
## [28] AnnotationDbi_1.36.2 XML_3.98-1.6 survival_2.40-1
## [31] foreign_0.8-67 rmarkdown_1.3 magrittr_1.5
## [34] MASS_7.3-45 scales_0.4.1 backports_1.0.4
## [37] htmltools_0.3.5 BiocGenerics_0.20.0 splines_3.3.2
## [40] colorspace_1.2-7 xtable_1.8-2 stringi_1.1.3
## [43] munsell_0.4.3 RCurl_1.95-4.8