IA-SVA based feature selection improves the performance of clustering algorithms [2]

Donghyung Lee

2017-08-15

Here, we used single cell RNA sequencing (scRNA-Seq) data with strong confounding variables, which is also obtained from human pancreatic islet samples (Xin et. al., 2016). This dataset is included in an R data package (“iasvaExamples”) containing data examples for IA-SVA (https://github.com/dleelab/iasvaExamples). To install the ‘iasvaExamples’ package, follow the instructions provided in the GitHub page.

Load packages

rm(list=ls())
library(iasva)
library(iasvaExamples)
library(sva)
library(irlba)
library(Rtsne)
library(pheatmap)
library(corrplot)
library(DescTools) #pcc i.e., Pearson's contingency coefficient
library(RColorBrewer)

color.vec <- brewer.pal(8, "Set1")

#color.pal from https://www.r-bloggers.com/the-paul-tol-21-color-salute/
tol21rainbow= c("#771155", "#AA4488", "#CC99BB", "#114477", "#4477AA", "#77AADD", "#117777", "#44AAAA", "#77CCCC", "#117744", "#44AA77", "#88CCAA", "#777711", "#AAAA44", "#DDDD77", "#774411", "#AA7744", "#DDAA77", "#771122", "#AA4455", "#DD7788")

Load the islet scRNA-Seq data (638 genes, >26K genes)

data("Xin_Islet_scRNAseq_Read_Counts")
data("Xin_Islet_scRNAseq_Annotations")
ls()
## [1] "color.vec"                      "tol21rainbow"                  
## [3] "Xin_Islet_scRNAseq_Annotations" "Xin_Islet_scRNAseq_Read_Counts"
counts <- Xin_Islet_scRNAseq_Read_Counts
anns <- Xin_Islet_scRNAseq_Annotations
dim(anns)
## [1] 1600    9
dim(counts)
## [1] 26616  1600
Lib_Size <- colSums(counts)
plot(sort(Lib_Size))

hist(Lib_Size)

summary(Lib_Size)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  200000  775400 1003000 1106000 1290000 7377000

The annotations describing the islet samples and experimental settings are stored in “anns” and the read counts information is stored in “counts”.

Filter out low-expressed genes dataset reduced to 18K genes and 617 cells

##counts <- counts[, (anns$Cell_Type!="none")] 
##anns <- subset(anns, (Cell_Type!="none"))
dim(counts)

[1] 26616 1600

dim(anns)

[1] 1600 9

anns <- droplevels(anns)

prop.zeros <- sum(counts==0)/length(counts)
prop.zeros

[1] 0.7042622

filter = apply(counts, 1, function(x) length(x[x>5])>=3)

counts = counts[filter,]
dim(counts)

[1] 19226 1600

prop.zeros <- sum(counts==0)/length(counts)
prop.zeros

[1] 0.5916613

summary(anns)
      X             Age        Cell_Type          Condition  

SRR3541303: 1 Min. :23.00 alpha:946 non-diabetic:651
SRR3541304: 1 1st Qu.:37.00 beta :503 T2D :949
SRR3541305: 1 Median :42.00 delta: 58
SRR3541306: 1 Mean :43.89 PP : 93
SRR3541307: 1 3rd Qu.:55.00
SRR3541308: 1 Max. :68.00
(Other) :1594
Donor_ID Ethnicity Gender Num_Expressed_Genes T2D 4 :306 AA:369 female:658 Min. :1745
T2D 6 :224 AI: 45 male :942 1st Qu.:4412
T2D 5 :176 C :593 Median :5158
Non T2D 12:144 H :593 Mean :5185
T2D 1 : 91 3rd Qu.:5902
T2D 2 : 83 Max. :9731
(Other) :576
Mitochondrial_Fraction Min. : 0.2812
1st Qu.: 4.1449
Median : 6.3422
Mean : 7.4164
3rd Qu.:10.1241
Max. :22.8479

Patient_ID <- anns$Donor_ID
Gender <- anns$Gender
Age <- anns$Age
Cell_Type <- anns$Cell_Type
Phenotype <- anns$Condition
Ethnicity <- anns$Ethnicity
Mito_Frac <- anns$Mitochondrial_Fraction

ContCoef(table(Cell_Type, Patient_ID))

[1] 0.4832611

ContCoef(table(Cell_Type, Gender))

[1] 0.1033314

ContCoef(table(Cell_Type, Age))

[1] 0.4782088

ContCoef(table(Cell_Type, Phenotype))

[1] 0.03317408

ContCoef(table(Cell_Type, Ethnicity))

[1] 0.2517959

Note that the orignial cell assignments are highly correlated with known factors (e.g., Patient_ID (pcc=0.48)).

Calculate geometric library size, i.e., library size of log-transfromed read counts

It is well known that the geometric library size (i.e., library size of log-transfromed read counts) or proportion of expressed genes in each cell explains a very large portion of variability of scRNA-Seq data (Hicks et. al. 2015 BioRxiv, McDavid et. al. 2016 Nature Biotechnology). Frequently, the first principal component of log-transformed scRNA-Seq read counts is highly correlated with the geometric library size (r > 0.9). Here, we calculate the geometric library size to be used as a known factor in the IA-SVA analyses.

Geo_Lib_Size <- colSums(log(counts+1))
barplot(Geo_Lib_Size, xlab="Cell", las =2)

lcounts <- log(counts + 1)
pca.res = irlba(lcounts - rowMeans(lcounts), 5)$v
cor(Geo_Lib_Size, pca.res[,1])
## [1] 0.9971137
dim(lcounts)
## [1] 19226  1600

Run tSNE to cluster islet cells.

For comparison purposes,we applied tSNE on read counts of all genes. We used the Rtsne R package with default settings. Genes are color coded wrt their expression of marker genes.

set.seed(34544532)
tsne.res.all <- Rtsne(t(lcounts), dims = 2)

plot(tsne.res.all$Y, main="tSNE", xlab="tSNE Dim1", ylab="tSNE Dim2",pch=21, col=color.vec[Cell_Type], bg=color.vec[Cell_Type], oma=c(4,4,6,12))
legend("topright", levels(Cell_Type), border="white", fill=color.vec, bty="n")

par(mfrow=c(2,2))
plot(tsne.res.all$Y, main="Gender", xlab="tSNE Dim1", ylab="tSNE Dim2", pch=21, col=color.vec[Gender], bg=color.vec[Gender], oma=c(4,4,6,12))
legend("topright", levels(Gender), border="white", fill=color.vec, bty="n")
plot(tsne.res.all$Y, main="Patient ID", xlab="tSNE Dim1", ylab="tSNE Dim2", pch=21, col=tol21rainbow[Patient_ID], bg=tol21rainbow[Patient_ID], oma=c(4,4,6,12))
legend("topright", levels(Patient_ID), border="white", fill=tol21rainbow, bty="n", cex=0.5)
plot(tsne.res.all$Y, main="Ethnicity", xlab="tSNE Dim1", ylab="tSNE Dim2", pch=21, col=color.vec[Ethnicity], bg=color.vec[Ethnicity], oma=c(4,4,6,12))
legend("topright", levels(Ethnicity), border="white", fill=color.vec, bty="n")
plot(tsne.res.all$Y, main="Phenotype", xlab="tSNE Dim1", ylab="tSNE Dim2", pch=21, col=color.vec[Phenotype], bg=color.vec[Phenotype], oma=c(4,4,6,12))
legend("topright", levels(Phenotype), border="white", fill=color.vec, bty="n")

par(mfrow=c(1,1))

Known factors deteriorate the performance of t-SNE.

Run PCA to cluster islet cells.

Here, we applied PCA on read counts of all genes. Genes are color coded wrt their expression of marker genes.

pairs(pca.res[,1:4], main="PCA", pch=21, col=color.vec[Cell_Type], bg=color.vec[Cell_Type], cex=0.8, oma=c(4,4,6,12))
legend("right", levels(Cell_Type), border="white", fill=color.vec, bty="n")

Run surrogate variable analysis (SVA).

Here, for comparison purposes we conduct SVA on our example data while adjusting for Patient_ID and Geo_Lib_Size and obtained 4 hidden factors.

mod1 <- model.matrix(~Patient_ID+Geo_Lib_Size)
mod0 <- cbind(mod1[,1])
sva.res = svaseq(counts,mod1,mod0, n.sv=4)$sv
## Number of significant surrogate variables is:  4 
## Iteration (out of 5 ):1  2  3  4  5
pairs(sva.res[,1:4], main="SVA", pch=21, col=color.vec[Cell_Type], bg=color.vec[Cell_Type], cex=0.8, oma=c(4,4,6,12))
legend("right", levels(Cell_Type), border="white", fill=color.vec, bty="n")

Run IA-SVA

Here, we run IA-SVA using Patient_ID and Geo_Lib_Size as known factors and identify 4 hidden factors.

mod <- model.matrix(~Patient_ID+Geo_Lib_Size)
iasva.res<- iasva(t(counts), mod[,-1],verbose=FALSE, permute=FALSE, num.sv=4)
## IA-SVA running...
## SV1 Detected!
## SV2 Detected!
## SV3 Detected!
## SV4 Detected!
## # of significant surrogate variables: 4
iasva.sv <- iasva.res$sv

## no color
pairs(iasva.sv[,1:4], pch=21, col="black", bg="black", cex=0.8)

## with color-coding
pairs(iasva.sv[,1:4], main="IA-SVA", pch=21, col=color.vec[Cell_Type], bg=color.vec[Cell_Type], cex=0.8, oma=c(4,4,6,12))
legend("right", levels(Cell_Type), border="white", fill=color.vec, bty="n")

As shown in the above figure (with no color), SV1, SV3 and SV4 separate cells. Since SV3 captures cell contamination, we consider S1 and SV4 in the subsequent analysis.

Correlation between SVs

cor(iasva.sv)
##             SV1         SV2         SV3         SV4
## SV1  1.00000000 -0.05318739 -0.04385273 -0.85917569
## SV2 -0.05318739  1.00000000 -0.14400770 -0.03225519
## SV3 -0.04385273 -0.14400770  1.00000000  0.08936163
## SV4 -0.85917569 -0.03225519  0.08936163  1.00000000
corrplot(cor(iasva.sv))

Find marker genes for SV1 and SV4.

Here, we find marker genes significantly associated with SV1 and SV4 (multiple testing adjusted p-value < 0.05, default significance cutoff) and having a high R-squared value (R-squared > 0.3) using the find.markers() function.

marker.counts <- find.markers(t(counts), as.matrix(iasva.sv[,c(1,4)]),  rsq.cutoff = 0.3)
## # of markers (SV1): 55
## # of markers (SV4): 39
## total # of unique markers:  57
nrow(marker.counts)
## [1] 57
anno.col <- data.frame(Cell_Type=Cell_Type)
rownames(anno.col) <- colnames(marker.counts)
head(anno.col)
##            Cell_Type
## SRR3541303      beta
## SRR3541304      beta
## SRR3541305      beta
## SRR3541306      beta
## SRR3541307      beta
## SRR3541308      beta
cell.type.col <- color.vec[1:4]
names(cell.type.col) <- c("alpha","beta","delta","PP")
anno.colors <- list(Cell_Type=cell.type.col)

pheatmap(log(marker.counts+1), show_colnames =FALSE, show_rownames = TRUE, clustering_method = "ward.D2",cutree_cols = 5,annotation_col = anno.col, annotation_colors=anno.colors)

Run tSNE post IA-SVA, i.e., run tSNE on 57 marker genes obtained from IA-SVA.

Here, we apply tSNE on the marker genes for SV1 and SV4 obtained from IA-SVA

set.seed(75458456)
tsne.res <- Rtsne(unique(t(log(marker.counts+1))), dims = 2)

plot(tsne.res$Y, main="tSNE post IA-SVA", xlab="tSNE Dim1", ylab="tSNE Dim2", pch=21, col=color.vec[Cell_Type], bg=color.vec[Cell_Type], oma=c(4,4,6,12))
legend("bottomright", levels(Cell_Type), border="white", fill=color.vec, bty="n")

tSNE performed on marker genes selected via IA-SVA performs better than original tSNE analyses using all genes. This example again reiterates the importance of gene selection using IA-SVA for effective clustering of single-cell datasets.

Session Info

sessionInfo()
## R version 3.3.2 (2016-10-31)
## Platform: x86_64-apple-darwin13.4.0 (64-bit)
## Running under: macOS Sierra 10.12.5
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] RColorBrewer_1.1-2  DescTools_0.99.19   corrplot_0.77      
##  [4] pheatmap_1.0.8      Rtsne_0.11          sva_3.22.0         
##  [7] genefilter_1.56.0   mgcv_1.8-16         nlme_3.1-128       
## [10] iasvaExamples_0.1.0 iasva_0.95          irlba_2.2.1        
## [13] Matrix_1.2-7.1      knitr_1.14          devtools_1.12.0    
## 
## loaded via a namespace (and not attached):
##  [1] Rcpp_0.12.10         plyr_1.8.4           formatR_1.4         
##  [4] bitops_1.0-6         tools_3.3.2          boot_1.3-18         
##  [7] digest_0.6.12        manipulate_1.0.1     gtable_0.2.0        
## [10] annotate_1.52.1      evaluate_0.10        memoise_1.0.0       
## [13] RSQLite_1.1-2        lattice_0.20-34      DBI_0.6-1           
## [16] yaml_2.1.14          parallel_3.3.2       expm_0.999-2        
## [19] mvtnorm_1.0-6        withr_1.0.2          stringr_1.2.0       
## [22] S4Vectors_0.12.2     IRanges_2.8.2        stats4_3.3.2        
## [25] rprojroot_1.2        grid_3.3.2           Biobase_2.34.0      
## [28] AnnotationDbi_1.36.2 XML_3.98-1.6         survival_2.40-1     
## [31] foreign_0.8-67       rmarkdown_1.3        magrittr_1.5        
## [34] MASS_7.3-45          scales_0.4.1         backports_1.0.4     
## [37] htmltools_0.3.5      BiocGenerics_0.20.0  splines_3.3.2       
## [40] colorspace_1.2-7     xtable_1.8-2         stringi_1.1.3       
## [43] munsell_0.4.3        RCurl_1.95-4.8