Here, we used single cell RNA sequencing (scRNA-Seq) data with strong confounding variables, which is also obtained from human pancreatic islet samples (Xin et. al., 2016). This dataset is included in an R data package (“iasvaExamples”) containing data examples for IA-SVA (https://github.com/dleelab/iasvaExamples). To install the ‘iasvaExamples’ package, follow the instructions provided in the GitHub page.
rm(list=ls())
library(iasva)
library(iasvaExamples)
library(sva)
library(irlba)
library(Rtsne)
library(pheatmap)
library(corrplot)
library(DescTools) #pcc i.e., Pearson's contingency coefficient
library(RColorBrewer)
color.vec <- brewer.pal(8, "Set1")
#color.pal from https://www.r-bloggers.com/the-paul-tol-21-color-salute/
tol21rainbow= c("#771155", "#AA4488", "#CC99BB", "#114477", "#4477AA", "#77AADD", "#117777", "#44AAAA", "#77CCCC", "#117744", "#44AA77", "#88CCAA", "#777711", "#AAAA44", "#DDDD77", "#774411", "#AA7744", "#DDAA77", "#771122", "#AA4455", "#DD7788")
data("Xin_Islet_scRNAseq_Read_Counts")
data("Xin_Islet_scRNAseq_Annotations")
ls()
## [1] "color.vec" "tol21rainbow"
## [3] "Xin_Islet_scRNAseq_Annotations" "Xin_Islet_scRNAseq_Read_Counts"
counts <- Xin_Islet_scRNAseq_Read_Counts
anns <- Xin_Islet_scRNAseq_Annotations
dim(anns)
## [1] 1600 9
dim(counts)
## [1] 26616 1600
Lib_Size <- colSums(counts)
plot(sort(Lib_Size))
hist(Lib_Size)
summary(Lib_Size)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 200000 775400 1003000 1106000 1290000 7377000
The annotations describing the islet samples and experimental settings are stored in “anns” and the read counts information is stored in “counts”.
##counts <- counts[, (anns$Cell_Type!="none")]
##anns <- subset(anns, (Cell_Type!="none"))
dim(counts)
[1] 26616 1600
dim(anns)
[1] 1600 9
anns <- droplevels(anns)
prop.zeros <- sum(counts==0)/length(counts)
prop.zeros
[1] 0.7042622
filter = apply(counts, 1, function(x) length(x[x>5])>=3)
counts = counts[filter,]
dim(counts)
[1] 19226 1600
prop.zeros <- sum(counts==0)/length(counts)
prop.zeros
[1] 0.5916613
summary(anns)
X Age Cell_Type Condition
SRR3541303: 1 Min. :23.00 alpha:946 non-diabetic:651
SRR3541304: 1 1st Qu.:37.00 beta :503 T2D :949
SRR3541305: 1 Median :42.00 delta: 58
SRR3541306: 1 Mean :43.89 PP : 93
SRR3541307: 1 3rd Qu.:55.00
SRR3541308: 1 Max. :68.00
(Other) :1594
Donor_ID Ethnicity Gender Num_Expressed_Genes T2D 4 :306 AA:369 female:658 Min. :1745
T2D 6 :224 AI: 45 male :942 1st Qu.:4412
T2D 5 :176 C :593 Median :5158
Non T2D 12:144 H :593 Mean :5185
T2D 1 : 91 3rd Qu.:5902
T2D 2 : 83 Max. :9731
(Other) :576
Mitochondrial_Fraction Min. : 0.2812
1st Qu.: 4.1449
Median : 6.3422
Mean : 7.4164
3rd Qu.:10.1241
Max. :22.8479
Patient_ID <- anns$Donor_ID
Gender <- anns$Gender
Age <- anns$Age
Cell_Type <- anns$Cell_Type
Phenotype <- anns$Condition
Ethnicity <- anns$Ethnicity
Mito_Frac <- anns$Mitochondrial_Fraction
ContCoef(table(Cell_Type, Patient_ID))
[1] 0.4832611
ContCoef(table(Cell_Type, Gender))
[1] 0.1033314
ContCoef(table(Cell_Type, Age))
[1] 0.4782088
ContCoef(table(Cell_Type, Phenotype))
[1] 0.03317408
ContCoef(table(Cell_Type, Ethnicity))
[1] 0.2517959
Note that the orignial cell assignments are highly correlated with known factors (e.g., Patient_ID (pcc=0.48)).
It is well known that the geometric library size (i.e., library size of log-transfromed read counts) or proportion of expressed genes in each cell explains a very large portion of variability of scRNA-Seq data (Hicks et. al. 2015 BioRxiv, McDavid et. al. 2016 Nature Biotechnology). Frequently, the first principal component of log-transformed scRNA-Seq read counts is highly correlated with the geometric library size (r > 0.9). Here, we calculate the geometric library size to be used as a known factor in the IA-SVA analyses.
Geo_Lib_Size <- colSums(log(counts+1))
barplot(Geo_Lib_Size, xlab="Cell", las =2)
lcounts <- log(counts + 1)
pca.res = irlba(lcounts - rowMeans(lcounts), 5)$v
cor(Geo_Lib_Size, pca.res[,1])
## [1] 0.9971137
dim(lcounts)
## [1] 19226 1600
For comparison purposes,we applied tSNE on read counts of all genes. We used the Rtsne R package with default settings. Genes are color coded wrt their expression of marker genes.
set.seed(34544532)
tsne.res.all <- Rtsne(t(lcounts), dims = 2)
plot(tsne.res.all$Y, main="tSNE", xlab="tSNE Dim1", ylab="tSNE Dim2",pch=21, col=color.vec[Cell_Type], bg=color.vec[Cell_Type], oma=c(4,4,6,12))
legend("topright", levels(Cell_Type), border="white", fill=color.vec, bty="n")
par(mfrow=c(2,2))
plot(tsne.res.all$Y, main="Gender", xlab="tSNE Dim1", ylab="tSNE Dim2", pch=21, col=color.vec[Gender], bg=color.vec[Gender], oma=c(4,4,6,12))
legend("topright", levels(Gender), border="white", fill=color.vec, bty="n")
plot(tsne.res.all$Y, main="Patient ID", xlab="tSNE Dim1", ylab="tSNE Dim2", pch=21, col=tol21rainbow[Patient_ID], bg=tol21rainbow[Patient_ID], oma=c(4,4,6,12))
legend("topright", levels(Patient_ID), border="white", fill=tol21rainbow, bty="n", cex=0.5)
plot(tsne.res.all$Y, main="Ethnicity", xlab="tSNE Dim1", ylab="tSNE Dim2", pch=21, col=color.vec[Ethnicity], bg=color.vec[Ethnicity], oma=c(4,4,6,12))
legend("topright", levels(Ethnicity), border="white", fill=color.vec, bty="n")
plot(tsne.res.all$Y, main="Phenotype", xlab="tSNE Dim1", ylab="tSNE Dim2", pch=21, col=color.vec[Phenotype], bg=color.vec[Phenotype], oma=c(4,4,6,12))
legend("topright", levels(Phenotype), border="white", fill=color.vec, bty="n")
par(mfrow=c(1,1))
Known factors deteriorate the performance of t-SNE.
Here, we applied PCA on read counts of all genes. Genes are color coded wrt their expression of marker genes.
pairs(pca.res[,1:4], main="PCA", pch=21, col=color.vec[Cell_Type], bg=color.vec[Cell_Type], cex=0.8, oma=c(4,4,6,12))
legend("right", levels(Cell_Type), border="white", fill=color.vec, bty="n")
Here, for comparison purposes we conduct SVA on our example data while adjusting for Patient_ID and Geo_Lib_Size and obtained 4 hidden factors.
mod1 <- model.matrix(~Patient_ID+Geo_Lib_Size)
mod0 <- cbind(mod1[,1])
sva.res = svaseq(counts,mod1,mod0, n.sv=4)$sv
## Number of significant surrogate variables is: 4
## Iteration (out of 5 ):1 2 3 4 5
pairs(sva.res[,1:4], main="SVA", pch=21, col=color.vec[Cell_Type], bg=color.vec[Cell_Type], cex=0.8, oma=c(4,4,6,12))
legend("right", levels(Cell_Type), border="white", fill=color.vec, bty="n")
Here, we run IA-SVA using Patient_ID and Geo_Lib_Size as known factors and identify 4 hidden factors.
mod <- model.matrix(~Patient_ID+Geo_Lib_Size)
iasva.res<- iasva(t(counts), mod[,-1],verbose=FALSE, permute=FALSE, num.sv=4)
## IA-SVA running...
## SV1 Detected!
## SV2 Detected!
## SV3 Detected!
## SV4 Detected!
## # of significant surrogate variables: 4
iasva.sv <- iasva.res$sv
## no color
pairs(iasva.sv[,1:4], pch=21, col="black", bg="black", cex=0.8)
## with color-coding
pairs(iasva.sv[,1:4], main="IA-SVA", pch=21, col=color.vec[Cell_Type], bg=color.vec[Cell_Type], cex=0.8, oma=c(4,4,6,12))
legend("right", levels(Cell_Type), border="white", fill=color.vec, bty="n")
As shown in the above figure (with no color), SV1, SV3 and SV4 separate cells. Since SV3 captures cell contamination, we consider S1 and SV4 in the subsequent analysis.
cor(iasva.sv)
## SV1 SV2 SV3 SV4
## SV1 1.00000000 -0.05318739 -0.04385273 -0.85917569
## SV2 -0.05318739 1.00000000 -0.14400770 -0.03225519
## SV3 -0.04385273 -0.14400770 1.00000000 0.08936163
## SV4 -0.85917569 -0.03225519 0.08936163 1.00000000
corrplot(cor(iasva.sv))
Here, we find marker genes significantly associated with SV1 and SV4 (multiple testing adjusted p-value < 0.05, default significance cutoff) and having a high R-squared value (R-squared > 0.3) using the find.markers() function.
marker.counts <- find.markers(t(counts), as.matrix(iasva.sv[,c(1,4)]), rsq.cutoff = 0.3)
## # of markers (SV1): 55
## # of markers (SV4): 39
## total # of unique markers: 57
nrow(marker.counts)
## [1] 57
anno.col <- data.frame(Cell_Type=Cell_Type)
rownames(anno.col) <- colnames(marker.counts)
head(anno.col)
## Cell_Type
## SRR3541303 beta
## SRR3541304 beta
## SRR3541305 beta
## SRR3541306 beta
## SRR3541307 beta
## SRR3541308 beta
cell.type.col <- color.vec[1:4]
names(cell.type.col) <- c("alpha","beta","delta","PP")
anno.colors <- list(Cell_Type=cell.type.col)
pheatmap(log(marker.counts+1), show_colnames =FALSE, show_rownames = TRUE, clustering_method = "ward.D2",cutree_cols = 5,annotation_col = anno.col, annotation_colors=anno.colors)
Here, we apply tSNE on the marker genes for SV1 and SV4 obtained from IA-SVA
set.seed(75458456)
tsne.res <- Rtsne(unique(t(log(marker.counts+1))), dims = 2)
plot(tsne.res$Y, main="tSNE post IA-SVA", xlab="tSNE Dim1", ylab="tSNE Dim2", pch=21, col=color.vec[Cell_Type], bg=color.vec[Cell_Type], oma=c(4,4,6,12))
legend("bottomright", levels(Cell_Type), border="white", fill=color.vec, bty="n")
tSNE performed on marker genes selected via IA-SVA performs better than original tSNE analyses using all genes. This example again reiterates the importance of gene selection using IA-SVA for effective clustering of single-cell datasets.
sessionInfo()
## R version 3.3.2 (2016-10-31)
## Platform: x86_64-apple-darwin13.4.0 (64-bit)
## Running under: macOS Sierra 10.12.5
##
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] RColorBrewer_1.1-2 DescTools_0.99.19 corrplot_0.77
## [4] pheatmap_1.0.8 Rtsne_0.11 sva_3.22.0
## [7] genefilter_1.56.0 mgcv_1.8-16 nlme_3.1-128
## [10] iasvaExamples_0.1.0 iasva_0.95 irlba_2.2.1
## [13] Matrix_1.2-7.1 knitr_1.14 devtools_1.12.0
##
## loaded via a namespace (and not attached):
## [1] Rcpp_0.12.10 plyr_1.8.4 formatR_1.4
## [4] bitops_1.0-6 tools_3.3.2 boot_1.3-18
## [7] digest_0.6.12 manipulate_1.0.1 gtable_0.2.0
## [10] annotate_1.52.1 evaluate_0.10 memoise_1.0.0
## [13] RSQLite_1.1-2 lattice_0.20-34 DBI_0.6-1
## [16] yaml_2.1.14 parallel_3.3.2 expm_0.999-2
## [19] mvtnorm_1.0-6 withr_1.0.2 stringr_1.2.0
## [22] S4Vectors_0.12.2 IRanges_2.8.2 stats4_3.3.2
## [25] rprojroot_1.2 grid_3.3.2 Biobase_2.34.0
## [28] AnnotationDbi_1.36.2 XML_3.98-1.6 survival_2.40-1
## [31] foreign_0.8-67 rmarkdown_1.3 magrittr_1.5
## [34] MASS_7.3-45 scales_0.4.1 backports_1.0.4
## [37] htmltools_0.3.5 BiocGenerics_0.20.0 splines_3.3.2
## [40] colorspace_1.2-7 xtable_1.8-2 stringi_1.1.3
## [43] munsell_0.4.3 RCurl_1.95-4.8