There are some publications recently to highlight the taxonomic classification using Greengenes database is better than BLAST for 16S data set. I make a toy example here using ComMA, to show how these two classifier are performed for a particular dataset.
The content of this tutorial includes how to interpret the community matrix of 16S (also known as OTU table) with taxonomic classifications, and create bar charts.
library(gg1L)
library(ComMA)
This 16S data set is from Miyake et al 2015, which was sequenced on a Roche 454 GS FLX Titanium and processed before OTU clustering. Their QC steps performed were as follows: 1. use MOTHUR to denoise (320min, 800max flows); 2. trim barcodes and primers; 3. remove homopolymers of more than 8bp; 4. remove sequences < 200bp lengths; 5. remove chimeric sequences using UCHIME; 6. taxonomic calling against Greengenes May 2013 release; 7. subsequent removal of chloroplast, mictochondrial, and non-bacterial sequences.
UPARSE pipeline is used to create OTUs. The version of USEARCH used is 8.0.1623, and the threshold is set to 97%. The community matrix (OTU table) miyake.454.16s.otus.csv in the result is available.
#getwd()
cm <- readCommunityMatrix("../data-raw/miyake.454.16s.otus.csv", "16S", minAbund=1)
##
## Upload 16S community matrix : 59 samples excluding row names from column 1, 2682 OTUs excluding column names from row 1, from file ../data-raw/miyake.454.16s.otus.csv
## Remove 0 rows whose minimum abundance < 1 !
cm[1:3,]
## Agah01 Agah02 Agah03 Agah04 Anig01 Anig02 Anig03
## Agah01_H0COW3M02H2QER 379 0 6 16 18 0 1
## Agah01_H2ZDNKJ02G8AL6 961 0 60 24 9 4 1
## Agah01_H0COW3M02HMT2E 448 0 0 2 0 1 0
## Anig04 Asoh01 Asoh02 Asoh03 Asoh04 Asoh05 Asoh06
## Agah01_H0COW3M02H2QER 2 110 18 2 197 12 168
## Agah01_H2ZDNKJ02G8AL6 3 262 3 26 308 95 8
## Agah01_H0COW3M02HMT2E 0 0 0 0 0 0 0
## Asoh07 Asoh08 Csor01 Csor02 Csor03 Cstr01 Cstr02
## Agah01_H0COW3M02H2QER 23 0 3 0 0 7 6
## Agah01_H2ZDNKJ02G8AL6 4 0 12 0 0 18 1
## Agah01_H0COW3M02HMT2E 0 0 4542 0 0 2 1
## Cstr03 Nele01 Nele02 Nele03 Nele04 Nele05 Nele06
## Agah01_H0COW3M02H2QER 0 0 0 0 0 0 0
## Agah01_H2ZDNKJ02G8AL6 0 3294 5305 2679 1773 2453 3198
## Agah01_H0COW3M02HMT2E 0 0 0 0 0 0 0
## Nele07 Nele08 Nele09 Nele10 Nele11 Nhex01 Nhex02
## Agah01_H0COW3M02H2QER 0 0 0 0 0 0 0
## Agah01_H2ZDNKJ02G8AL6 3856 3102 2991 4455 5241 0 0
## Agah01_H0COW3M02HMT2E 0 0 0 0 0 0 0
## Nhex03 Nhex04 Nuni01 Nuni02 Nuni03 Snig01 Snig02
## Agah01_H0COW3M02H2QER 2 3 0 2 0 6 2
## Agah01_H2ZDNKJ02G8AL6 1 0 1478 1418 1204 2 4
## Agah01_H0COW3M02HMT2E 1 0 0 0 0 0 0
## Snig03 Sste01 Sste02 Sste03 Zdes01 Zdes02 Zdes03
## Agah01_H0COW3M02H2QER 0 0 0 0 2313 1706 1771
## Agah01_H2ZDNKJ02G8AL6 0 1 0 2 20 413 302
## Agah01_H0COW3M02HMT2E 25 0 0 0 0 0 0
## Zdes04 Zdes05 Zxan01 Zxan02 Zxan03 Zxan04 Zxan05
## Agah01_H0COW3M02H2QER 395 371 437 2540 2619 221 361
## Agah01_H2ZDNKJ02G8AL6 0 1275 19 627 16 28 29
## Agah01_H0COW3M02HMT2E 0 0 0 0 0 1 0
## Zxan06 Zxan07 Zxan08
## Agah01_H0COW3M02H2QER 650 76 1
## Agah01_H2ZDNKJ02G8AL6 1433 20 2
## Agah01_H0COW3M02HMT2E 0 0 0
summaryCM(cm)
## total
## reads 202,430
## OTUs 2,682
## samples 59
## singletons 939
## doubletons 353
## max.OTU.abundance 48,440
## min.OTU.abundance 1
## max.sample.abundance 10,612
## min.sample.abundance 1,054
The taxonomy of OTUs is first classified by the RDP classifier using the latest version of Greengenes database (May 2013). And then, the second taxonomy of same OTUs is classified by BLAST and interpreted by MEGAN.
The table of taxonomy from kingdom to genus is uploaded as below, where MEGAN repeats the last classified taxonomy at the higher rank through the lineage for all unclassified taxonomy, but RDP keeps them in blank.
Note: Uncultured Bacteria (taxid:77133) is excluded during BLAST in order to give more accurate taxonomic classification.
The classification without excluding Uncultured Bacteria provides much worse result for this 16S dataset.
#BLAST
tt.megan <- readTaxaTable("../data-raw/miyake.454.16s.megan.txt", "16S megan taxa table", taxa.group="all")
##
## Upload 16S megan taxa table taxonomy table : 7 columns excluding row names from column 1, 2682 OTUs excluding column names from row 1, from file ../data-raw/miyake.454.16s.megan.txt
# 1st column is taxonomic lineage path
tt.megan[1:3,-1]
## kingdom phylum class
## Agah01_H0COW3M02F0MYH Bacteria Firmicutes Firmicutes
## Agah01_H0COW3M02F0SWP root root root
## Agah01_H0COW3M02F1ZMK Bacteria Proteobacteria Gammaproteobacteria
## order family
## Agah01_H0COW3M02F0MYH Firmicutes Firmicutes
## Agah01_H0COW3M02F0SWP root root
## Agah01_H0COW3M02F1ZMK Gammaproteobacteria Gammaproteobacteria
## genus
## Agah01_H0COW3M02F0MYH Firmicutes
## Agah01_H0COW3M02F0SWP root
## Agah01_H0COW3M02F1ZMK Gammaproteobacteria
# RDP
tt.rdp <- readTaxaTable("../data-raw/miyake.454.16s.greengenes.txt", "16S rdp taxa table", taxa.group="all")
##
## Upload 16S rdp taxa table taxonomy table : 9 columns excluding row names from column 1, 2682 OTUs excluding column names from row 1, from file ../data-raw/miyake.454.16s.greengenes.txt
# 1st column is taxonomic lineage path
tt.rdp[1:3,-1]
## confidence kingdom phylum
## Agah01_H0COW3M02F0MYH 0.53 Bacteria Firmicutes
## Agah01_H0COW3M02F0SWP 0.79 Bacteria TM7
## Agah01_H0COW3M02F1ZMK 0.91 Bacteria Proteobacteria
## class order family
## Agah01_H0COW3M02F0MYH Clostridia Clostridiales Lachnospiraceae
## Agah01_H0COW3M02F0SWP
## Agah01_H0COW3M02F1ZMK Gammaproteobacteria Legionellales
## genus species
## Agah01_H0COW3M02F0MYH Coprococcus
## Agah01_H0COW3M02F0SWP
## Agah01_H0COW3M02F1ZMK
The number of reads of OTUs assigned to different taxonomies is summarised below regarding to classifiers. Because this dataset is preprossed, RDP using 0.5 confidence threshold do not have any unclassified taxonomy.
# return a list of taxa assignments, names = c("megan", "rdp.5", "rdp.8")
getTaxaAssignList <- function(cm, tt.megan, tt.rdp, taxa.group="all", rank="kingdom") {
cm.taxa <- mergeCMTaxa(cm, tt.megan)
ta.megan <- assignTaxaByRank(cm.taxa)
sum(ta.megan[[rank]])
cm.taxa <- mergeCMTaxa(cm, tt.rdp, classifier="RDP")
ta.rdp.8 <- assignTaxaByRank(cm.taxa)
sum(ta.rdp.8[[rank]])
cm.taxa <- mergeCMTaxa(cm, tt.rdp, classifier="RDP", min.conf=0.5)
ta.rdp.5 <- assignTaxaByRank(cm.taxa)
sum(ta.rdp.5[[rank]])
list(megan=ta.megan, rdp.5=ta.rdp.5, rdp.8=ta.rdp.8)
}
getReads <- function(ta.list, rank="kingdom") {
reads.megan <- ta.list[["megan"]][[rank]]
reads.megan$classifier <- "MEGAN"
reads.megan[,rank] <- rownames(reads.megan)
reads.rdp.8 <- ta.list[["rdp.8"]][[rank]]
reads.rdp.8$classifier <- "RDP 0.8"
reads.rdp.8[,rank] <- rownames(reads.rdp.8)
reads.rdp.5 <- ta.list[["rdp.5"]][[rank]]
reads.rdp.5$classifier <- "RDP 0.5"
reads.rdp.5[,rank] <- rownames(reads.rdp.5)
reads <- rbind(reads.megan, reads.rdp.8, reads.rdp.5)
}
ta.list <- getTaxaAssignList(cm, tt.megan, tt.rdp)
## Parse classification from MEGAN classifier , and preprocess taxanomy names.
## Preprocess taxonomy names at rank columns : kingdom,phylum,class,order,family
## Merge 2682 rows in community matrix with 2682 rows in taxa table, get 2682 classified OTUs.
## Preprocess taxonomy names at rank columns : kingdom,phylum,class,order,family
## Parse classification from RDP classifier , and preprocess taxanomy names.
## Set 914 rows as 'unclassified' from the total of 2682 in RDP taxa table, whose confidence < 0.8 .
## Preprocess taxonomy names at rank columns : kingdom,phylum,class,order,family
## Merge 2682 rows in community matrix with 2682 rows in taxa table, get 2682 classified OTUs.
## Preprocess taxonomy names at rank columns : kingdom,phylum,class,order,family
## Parse classification from RDP classifier , and preprocess taxanomy names.
## Set 0 rows as 'unclassified' from the total of 2682 in RDP taxa table, whose confidence < 0.5 .
## Preprocess taxonomy names at rank columns : kingdom,phylum,class,order,family
## Merge 2682 rows in community matrix with 2682 rows in taxa table, get 2682 classified OTUs.
## Preprocess taxonomy names at rank columns : kingdom,phylum,class,order,family
names(ta.list[["megan"]])
## [1] "kingdom" "phylum" "class" "order" "family"
ta.list[["megan"]][["kingdom"]]
## total
## Bacteria 194527
## Eukaryota 62
## unclassified 7841
ta.list[["rdp.8"]][["phylum"]]
## total
## Acidobacteria 87
## Actinobacteria 6761
## Bacteroidetes 3860
## Chlamydiae 8
## Chloroflexi 13
## Cyanobacteria 2376
## Deferribacteres 15
## Elusimicrobia 10
## Fibrobacteres 1
## Firmicutes 120806
## Fusobacteria 8619
## GN02 1
## Lentisphaerae 22
## Planctomycetes 1
## Proteobacteria 18101
## SBR1093 24
## Spirochaetes 441
## Synergistetes 71
## Tenericutes 250
## Thermi 2
## TM6 109
## TM7 52
## Verrucomicrobia 569
## WPS-2 17
## WS2 1
## WS3 1
## unclassified Bacteria 43
## unclassified 40169
ta <- merge(ta.list[["rdp.5"]][["phylum"]], ta.list[["rdp.8"]][["phylum"]],
by = "row.names", all = TRUE)
ta[is.na(ta)] <- 0
colnames(ta) <- c("phylum", "rdp.5", "rdp.8")
bar.per <- ggPercentageBarChart(ta, melt.id="phylum", verbose=F, title="Abundance Percentage Bar")
bar.per$gg.plot
grid_arrange_shared_legend
shares a legend between multiple plots, and returns a gtable object:
kingdom <- getReads(ta.list)
kingdom
## total classifier kingdom
## Bacteria 194527 MEGAN Bacteria
## Eukaryota 62 MEGAN Eukaryota
## unclassified 7841 MEGAN unclassified
## Bacteria1 162261 RDP 0.8 Bacteria
## unclassified1 40169 RDP 0.8 unclassified
## Bacteria2 202430 RDP 0.5 Bacteria
bar.kingdom <- ggBarChart(kingdom, x.id="kingdom", y.id="total",
fill.id="classifier", y.trans="log", y.lab="reads",
title="Classification Including Singletons", verbose=F)
bar.kingdom <- bar.kingdom + ggAddNumbers(label.id="total", text.size=3,
hjust=0.5, vjust=2)
bar.kingdom
# no singleton
cm.min2 <- readCommunityMatrix("../data-raw/miyake.454.16s.otus.csv", "16S")
##
## Upload 16S community matrix : 59 samples excluding row names from column 1, 2682 OTUs excluding column names from row 1, from file ../data-raw/miyake.454.16s.otus.csv
## Remove 939 rows whose minimum abundance < 2 !
summaryCM(cm.min2)
## total
## reads 201,491
## OTUs 1,743
## samples 59
## singletons 0
## doubletons 353
## max.OTU.abundance 48,440
## min.OTU.abundance 2
## max.sample.abundance 10,520
## min.sample.abundance 1,054
# no singleton
ta.list <- getTaxaAssignList(cm.min2, tt.megan, tt.rdp)
## Parse classification from MEGAN classifier , and preprocess taxanomy names.
## Preprocess taxonomy names at rank columns : kingdom,phylum,class,order,family
## Merge 1743 rows in community matrix with 2682 rows in taxa table, get 1743 classified OTUs.
## Preprocess taxonomy names at rank columns : kingdom,phylum,class,order,family
## Parse classification from RDP classifier , and preprocess taxanomy names.
## Set 914 rows as 'unclassified' from the total of 2682 in RDP taxa table, whose confidence < 0.8 .
## Preprocess taxonomy names at rank columns : kingdom,phylum,class,order,family
## Merge 1743 rows in community matrix with 2682 rows in taxa table, get 1743 classified OTUs.
## Preprocess taxonomy names at rank columns : kingdom,phylum,class,order,family
## Parse classification from RDP classifier , and preprocess taxanomy names.
## Set 0 rows as 'unclassified' from the total of 2682 in RDP taxa table, whose confidence < 0.5 .
## Preprocess taxonomy names at rank columns : kingdom,phylum,class,order,family
## Merge 1743 rows in community matrix with 2682 rows in taxa table, get 1743 classified OTUs.
## Preprocess taxonomy names at rank columns : kingdom,phylum,class,order,family
kingdom.min2 <- getReads(ta.list)
kingdom.min2
## total classifier kingdom
## Bacteria 193839 MEGAN Bacteria
## Eukaryota 49 MEGAN Eukaryota
## unclassified 7603 MEGAN unclassified
## Bacteria1 161650 RDP 0.8 Bacteria
## unclassified1 39841 RDP 0.8 unclassified
## Bacteria2 201491 RDP 0.5 Bacteria
merge.kingdom <- merge(kingdom, kingdom.min2, by=c("kingdom", "classifier"))
# informal code to select Bacteria
bacteria <- merge.kingdom[merge.kingdom=="Bacteria",]
bacteria$percentage <- bacteria$total.y / bacteria$total.x
bacteria$singletons <- 1 - bacteria$percentage
bacteria
## kingdom classifier total.x total.y percentage singletons
## 1 Bacteria MEGAN 194527 193839 0.9964632 0.003536784
## 2 Bacteria RDP 0.5 202430 201491 0.9953614 0.004638641
## 3 Bacteria RDP 0.8 162261 161650 0.9962345 0.003765538
ggBarChart(bacteria, x.id="classifier", y.id="singletons", y.trans="per",
title="Percentage of Singletons Classified as Bacteria", verbose=F)
# select Bacteria
tt.megan <- readTaxaTable("../data-raw/miyake.454.16s.megan.txt", "16S megan taxa table", taxa.group="Bacteria")
##
## Upload 16S megan taxa table taxonomy table : 7 columns excluding row names from column 1, 2682 OTUs excluding column names from row 1, from file ../data-raw/miyake.454.16s.megan.txt
## Select 2069 classifications, by given taxa.group = Bacteria , rank = kingdom , include = TRUE .
tt.rdp <- readTaxaTable("../data-raw/miyake.454.16s.greengenes.txt", "16S rdp taxa table", taxa.group="Bacteria")
##
## Upload 16S rdp taxa table taxonomy table : 9 columns excluding row names from column 1, 2682 OTUs excluding column names from row 1, from file ../data-raw/miyake.454.16s.greengenes.txt
n.taxa.df <- data.frame(row.names=c("MEGAN", "RDP 0.5", "RDP 0.8"))
reads.df <- data.frame(row.names=c("MEGAN", "RDP 0.5", "RDP 0.8"))
r <- 1
for (ta.name in c(c("megan", "rdp.5", "rdp.8"))) {
print(paste("Classifier :", ta.name))
ta.df <- ta.list[[ta.name]]
# start from phylum
for (ra in names(ta.df)[-1]) {
ta <- ta.df[[ra]]
n.uncl <- length(grep("unclassified", rownames(ta), ignore.case = T))
print(paste(ra, ": total taxa =", nrow(ta), ", where",
n.uncl, "unclassified taxa."))
# not count "unclassified"
n.taxa.df[r, ra] <- nrow(ta) - n.uncl
ta <- subset(ta, !grepl("unclassified", rownames(ta), ignore.case = T))
reads.df[r, ra] <- sum(ta[,"total"])
}
r <- r + 1
}
## [1] "Classifier : megan"
## [1] "phylum : total taxa = 21 , where 3 unclassified taxa."
## [1] "class : total taxa = 53 , where 17 unclassified taxa."
## [1] "order : total taxa = 82 , where 28 unclassified taxa."
## [1] "family : total taxa = 97 , where 38 unclassified taxa."
## [1] "Classifier : rdp.5"
## [1] "phylum : total taxa = 22 , where 1 unclassified taxa."
## [1] "class : total taxa = 61 , where 14 unclassified taxa."
## [1] "order : total taxa = 102 , where 22 unclassified taxa."
## [1] "family : total taxa = 164 , where 39 unclassified taxa."
## [1] "Classifier : rdp.8"
## [1] "phylum : total taxa = 23 , where 2 unclassified taxa."
## [1] "class : total taxa = 59 , where 14 unclassified taxa."
## [1] "order : total taxa = 99 , where 22 unclassified taxa."
## [1] "family : total taxa = 146 , where 35 unclassified taxa."
# result
n.taxa.df
## phylum class order family
## MEGAN 18 36 54 59
## RDP 0.5 21 47 80 125
## RDP 0.8 21 45 77 111
n.taxa.df$id = rownames(n.taxa.df)
n.taxa.melt <- melt(n.taxa.df)
## Using id as id variables
ggBarChart(n.taxa.melt, x.id="variable", y.id="value", fill.id="id",
x.lab="rank", y.lab="classified taxonomy",
title="The Number Of Classified Taxonomy Under Bacteria (No Singletons)",
legend.title="classifier", verbose=F) +
ggAddNumbers(label.id="value", text.size=3, hjust=0.5, vjust=2)
# result
reads.df
## phylum class order family
## MEGAN 139899 117891 41771 22989
## RDP 0.5 201458 189928 188391 178640
## RDP 0.8 161617 154821 154180 146210
reads.df$id = rownames(reads.df)
reads.melt <- melt(reads.df)
## Using id as id variables
ggBarChart(reads.melt, x.id="variable", y.id="value", fill.id="id",
x.lab="rank", y.lab="reads of classified OTUs",
title="Reads Of Classified OTUs Under Bacteria (No Singletons)",
legend.title="classifier", verbose=F) +
ggAddNumbers(label.id="value", text.size=3, hjust=0.5, vjust=2)
RDP + Greengenes using 0.8 confidence threshold is recommanded to make the taxonomic classification for 16S dataset. But dropping confidence threshold with concerns could increase the abundance of classified taxonomy above genus.