The iasva package can be used to detect hidden heterogenity within bulk or single cell sequencing data. To illustrate how to use the iasva package for heterogenity detection, we use real-world single cell RNA sequencing (scRNA-Seq) data obtained from human pancreatic islet samples (Lawlor et. al., 2016). This dataset is included in a R data package (“iasvaExamples”) containing data examples for IA-SVA (https://github.com/dleelab/iasvaExamples). To install the package, follow the instruction provided in the GitHub page.
#devtools
library(devtools)
#iasva
devtools::install_github("UcarLab/IA-SVA")
#iasvaExamples
devtools::install_github("dleelab/iasvaExamples")
rm(list=ls())
library(irlba) # partial SVD, the augmented implicitly restarted Lanczos bidiagonalization algorithm
library(iasva)
library(iasvaExamples)
library(sva)
library(Rtsne)
library(pheatmap)
library(corrplot)
library(DescTools) #pcc i.e., Pearson's contingency coefficient
library(RColorBrewer)
color.vec <- brewer.pal(3, "Set1")
data("Lawlor_Islet_scRNAseq_Read_Counts")
data("Lawlor_Islet_scRNAseq_Annotations")
ls()
## [1] "color.vec" "Lawlor_Islet_scRNAseq_Annotations"
## [3] "Lawlor_Islet_scRNAseq_Read_Counts"
counts <- Lawlor_Islet_scRNAseq_Read_Counts
anns <- Lawlor_Islet_scRNAseq_Annotations
dim(anns)
## [1] 638 26
dim(counts)
## [1] 26616 638
summary(anns)
## run cell.type COL1A1 INS
## Length:638 Length:638 Min. :1.00 Min. :1.000
## Class :character Class :character 1st Qu.:1.00 1st Qu.:1.000
## Mode :character Mode :character Median :1.00 Median :1.000
## Mean :1.03 Mean :1.414
## 3rd Qu.:1.00 3rd Qu.:2.000
## Max. :2.00 Max. :2.000
##
## PRSS1 SST GCG KRT19
## Min. :1.000 Min. :1.000 Min. :1.000 Min. :1.000
## 1st Qu.:1.000 1st Qu.:1.000 1st Qu.:1.000 1st Qu.:1.000
## Median :1.000 Median :1.000 Median :1.000 Median :1.000
## Mean :1.038 Mean :1.039 Mean :1.375 Mean :1.044
## 3rd Qu.:1.000 3rd Qu.:1.000 3rd Qu.:2.000 3rd Qu.:1.000
## Max. :2.000 Max. :2.000 Max. :2.000 Max. :2.000
##
## PPY num.genes Cell_ID UNOS_ID
## Min. :1.000 Min. :3529 10th_C1_S59 : 1 ACCG268 :136
## 1st Qu.:1.000 1st Qu.:5270 10th_C10_S104: 1 ACJV399 :108
## Median :1.000 Median :6009 10th_C11_S96 : 1 ACEL337 :103
## Mean :1.028 Mean :5981 10th_C13_S61 : 1 ACIW009 : 93
## 3rd Qu.:1.000 3rd Qu.:6676 10th_C14_S53 : 1 ACCR015A: 57
## Max. :2.000 Max. :8451 10th_C16_S105: 1 ACIB065 : 57
## (Other) :632 (Other) : 84
## Age Biomaterial_Provider Gender Phenotype
## Min. :22.00 IIDP : 45 Female:303 Non-Diabetic :380
## 1st Qu.:29.00 Prodo Labs:593 Male :335 Type 2 Diabetic:258
## Median :42.00
## Mean :39.63
## 3rd Qu.:53.00
## Max. :56.00
##
## Race BMI Cell_Type Patient_ID
## African American:175 Min. :22.00 INS :264 P1 :136
## Hispanic :165 1st Qu.:26.60 GCG :239 P8 :108
## White :298 Median :32.95 KRT19 : 28 P3 :103
## Mean :32.85 SST : 25 P7 : 93
## 3rd Qu.:35.80 PRSS1 : 24 P5 : 57
## Max. :55.00 none : 21 P6 : 57
## (Other): 37 (Other): 84
## Sequencing_Run Batch Coverage Percent_Aligned
## 12t : 57 B1:193 Min. :1206135 Min. :0.8416
## 4th : 57 B2:148 1st Qu.:2431604 1st Qu.:0.8769
## 9th : 57 B3:297 Median :3042800 Median :0.8898
## 10t : 56 Mean :3160508 Mean :0.8933
## 7th : 55 3rd Qu.:3871697 3rd Qu.:0.9067
## 3rd : 53 Max. :5931638 Max. :0.9604
## (Other):303
## Mitochondrial_Fraction Num_Expressed_Genes
## Min. :0.003873 Min. :3529
## 1st Qu.:0.050238 1st Qu.:5270
## Median :0.091907 Median :6009
## Mean :0.108467 Mean :5981
## 3rd Qu.:0.142791 3rd Qu.:6676
## Max. :0.722345 Max. :8451
##
ContCoef(table(anns$Gender, anns$Cell_Type))
## [1] 0.225969
ContCoef(table(anns$Phenotype, anns$Cell_Type))
## [1] 0.1145096
ContCoef(table(anns$Race, anns$Cell_Type))
## [1] 0.3084146
ContCoef(table(anns$Patient_ID, anns$Cell_Type))
## [1] 0.5232058
ContCoef(table(anns$Batch, anns$Cell_Type))
## [1] 0.3295619
The annotations describing the islet samples and experimental settings are stored in “anns” and the read counts information is stored in “counts”.
To illustrate how IA-SVA can be used to detect hidden heterogeneity within a homogenous cell population (i.e., alpha cells), we use read counts of alpha cells from healthy (non-diabetic) subjects (n = 101).
counts <- counts[, (anns$Phenotype!="Non-Diabetic")&(anns$Cell_Type=="GCG")]
anns <- subset(anns, (Phenotype!="Non-Diabetic")&(Cell_Type=="GCG"))
dim(counts)
[1] 26616 101
dim(anns)
[1] 101 26
anns <- droplevels(anns)
prop.zeros <- sum(counts==0)/length(counts)
prop.zeros
[1] 0.6954073
# filter out genes that are sparsely and lowly expressed
filter = apply(counts, 1, function(x) length(x[x>5])>=3)
counts = counts[filter,]
dim(counts)
[1] 14416 101
prop.zeros <- sum(counts==0)/length(counts)
prop.zeros
[1] 0.4520953
It is well known that the geometric library size (i.e., library size of log-transfromed read counts) or proportion of expressed genes in each cell explains a very large portion of variability of scRNA-Seq data (Hicks et. al. 2015 BioRxiv, McDavid et. al. 2016 Nature Biotechnology). Frequently, the first principal component of log-transformed scRNA-Seq read counts is highly correlated with the geometric library size (r ~ 0.9). Here, we calculate the geometric library size vector, which will be used as a known factor in the IA-SVA algorithm.
Geo_Lib_Size <- colSums(log(counts+1))
barplot(Geo_Lib_Size, xlab="Cell", ylab="Geometric Lib Size", las=2)
lcounts <- log(counts + 1)
# PC1 and Geometric library size correlation
pc1 = irlba(lcounts - rowMeans(lcounts), 1)$v[,1] ## partial SVD
cor(Geo_Lib_Size, pc1)
## [1] 0.9966524
Here, we run IA-SVA using Patient_ID and Geo_Lib_Size as known factors and identify five hidden factors. SVs are plotted in a pairwise fashion to uncover which SVs can seperate cell types.
set.seed(454353)
Patient_ID <- anns$Patient_ID
mod <- model.matrix(~Patient_ID+Geo_Lib_Size)
iasva.res<- iasva(t(counts), mod[,-1],verbose=FALSE, permute=FALSE, num.sv=5) ##irlba
## IA-SVA running...
## SV1 Detected!
## SV2 Detected!
## SV3 Detected!
## SV4 Detected!
## SV5 Detected!
## # of significant surrogate variables: 5
iasva.sv <- iasva.res$sv
plot(iasva.sv[,1], iasva.sv[,2], xlab="SV1", ylab="SV2")
Cell_Type <- as.factor(iasva.sv[,2] > -0.2)
levels(Cell_Type)=c("Cell1","Cell2")
table(Cell_Type)
## Cell_Type
## Cell1 Cell2
## 6 95
# We identified 6 outlier cells based on SV2 that are marked in red
pairs(iasva.sv, main="IA-SVA", pch=21, col=color.vec[Cell_Type], bg=color.vec[Cell_Type], oma=c(4,4,6,12)) #4,4,6,12
legend("right", levels(Cell_Type), fill=color.vec, bty="n")
plot(iasva.sv[,1:2], main="IA-SVA", pch=21, xlab="SV1", ylab="SV2", col=color.vec[Cell_Type], bg=color.vec[Cell_Type])
cor(Geo_Lib_Size, iasva.sv[,2])
## [1] -0.1776038
corrplot(cor(iasva.sv))
As shown in the above figure, SV2 clearly separates alpha cells into two groups: 6 outlier cells (marked in red) and the rest of the alpha cells (marked in green). SV3 and SV4 also capture outlier cells. However, we will focus on SV2 in the rest of the analyses.
Here, using the find.markers() function we find marker genes (n=105 genes) that are significantly associated with SV2 (multiple testing adjusted p-value < 0.05, default significance cutoff, and R-squared value > 0.3, default R-squared cutoff).
marker.counts <- find.markers(t(counts), as.matrix(iasva.sv[,2]))
## # of markers (): 105
## total # of unique markers: 105
nrow(marker.counts)
## [1] 105
rownames(marker.counts)
## [1] "SMAD7" "PMEPA1" "FAM198B" "ANGPTL2"
## [5] "LINC00152" "WIPF1" "MYCT1" "FLT1"
## [9] "SYNM" "MIR4435-1HG" "RNF24" "ENG"
## [13] "RASSF2" "NFIB" "SLC1A5" "SOX4"
## [17] "ID3" "ITGA5" "TMEM233" "FMNL2"
## [21] "PXDN" "PRDM1" "C8orf4" "ERG"
## [25] "RFTN2" "ZNF503" "CLIC4" "RP11-160E2.19"
## [29] "DAB2" "JAG1" "LGALS9" "ARHGAP31"
## [33] "SNAI1" "TIMP3" "A2M" "DCHS1"
## [37] "PPAP2B" "DPYSL3" "UACA" "THBS1"
## [41] "HTRA1" "MEF2C" "SERPINB8" "CHST3"
## [45] "FSTL1" "CLIC2" "LGALS1" "IL32"
## [49] "S100A16" "CD93" "IGFBP4" "RBMS1"
## [53] "COL18A1" "LAMA4" "STC1" "STAB1"
## [57] "ACVRL1" "ELTD1" "COL4A1" "LPHN2"
## [61] "COL4A2" "MSN" "CTHRC1" "ELK3"
## [65] "EMP1" "TINAGL1" "TNFAIP2" "P2RY6"
## [69] "MCAM" "C1QTNF5" "MMP2" "HBEGF"
## [73] "SERPINE1" "SPARC" "SPARCL1" "RAPGEF5"
## [77] "ESAM" "KDR" "COL15A1" "RASSF3"
## [81] "REST" "ITPRIPL2" "ETS1" "GMFG"
## [85] "ANO7" "GNG11" "VWF" "HDAC7"
## [89] "CD9" "IFITM2" "IFITM3" "PTRF"
## [93] "CXCR4" "SERPING1" "PODXL" "NES"
## [97] "PLVAP" "CALD1" "LAMB2" "IL4R"
## [101] "MMP1" "IFI16" "RHOJ" "RBP5"
## [105] "ADAMTS4"
anno.col <- data.frame(Cell_Type=Cell_Type)
rownames(anno.col) <- colnames(marker.counts)
head(anno.col)
## Cell_Type
## 4th-C63_S30 Cell2
## 4th-C66_S36 Cell2
## 4th-C18_S31 Cell2
## 4th-C57_S18 Cell1
## 4th-C56_S17 Cell2
## 4th-C68_S41 Cell2
pheatmap(log(marker.counts+1), show_colnames =FALSE, clustering_method = "ward.D2",cutree_cols = 2,annotation_col = anno.col)
Here, we apply tSNE on the marker genes for SV2 obtained from IA-SVA
set.seed(345233)
tsne.res <- Rtsne(unique(t(log(marker.counts+1))), dims = 2)
plot(tsne.res$Y, main="tSNE post IA-SVA", xlab="tSNE Dim1", ylab="tSNE Dim2", pch=21, col=color.vec[Cell_Type], bg=color.vec[Cell_Type], oma=c(4,4,6,12))
legend("bottomright", levels(Cell_Type), fill=color.vec, bty="n")
tSNE using SV2 marker genes better seperate these ourlier cells. This analyses suggest that gene selection using IA-SVA combined with tSNE analyses can be a powerful way to detect rare cells introducing variability in the single cell gene expression data.
cor(Geo_Lib_Size, iasva.sv[,2])
## [1] -0.1776038
sessionInfo()
## R version 3.3.2 (2016-10-31)
## Platform: x86_64-apple-darwin13.4.0 (64-bit)
## Running under: macOS Sierra 10.12.5
##
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] RColorBrewer_1.1-2 DescTools_0.99.19 corrplot_0.77
## [4] pheatmap_1.0.8 Rtsne_0.11 sva_3.22.0
## [7] genefilter_1.56.0 mgcv_1.8-16 nlme_3.1-128
## [10] iasvaExamples_0.1.0 iasva_0.95 irlba_2.2.1
## [13] Matrix_1.2-7.1 knitr_1.14 devtools_1.12.0
##
## loaded via a namespace (and not attached):
## [1] Rcpp_0.12.10 plyr_1.8.4 formatR_1.4
## [4] bitops_1.0-6 tools_3.3.2 boot_1.3-18
## [7] digest_0.6.12 manipulate_1.0.1 gtable_0.2.0
## [10] annotate_1.52.1 evaluate_0.10 memoise_1.0.0
## [13] RSQLite_1.1-2 lattice_0.20-34 DBI_0.6-1
## [16] yaml_2.1.14 parallel_3.3.2 expm_0.999-2
## [19] mvtnorm_1.0-6 withr_1.0.2 stringr_1.2.0
## [22] S4Vectors_0.12.2 IRanges_2.8.2 stats4_3.3.2
## [25] rprojroot_1.2 grid_3.3.2 Biobase_2.34.0
## [28] AnnotationDbi_1.36.2 XML_3.98-1.6 survival_2.40-1
## [31] foreign_0.8-67 rmarkdown_1.3 magrittr_1.5
## [34] MASS_7.3-45 scales_0.4.1 backports_1.0.4
## [37] htmltools_0.3.5 BiocGenerics_0.20.0 splines_3.3.2
## [40] colorspace_1.2-7 xtable_1.8-2 stringi_1.1.3
## [43] munsell_0.4.3 RCurl_1.95-4.8