Introduction

There are some publications recently to highlight the taxonomic classification using Greengenes database is better than BLAST for 16S data set. I make a toy example here using ComMA, to show how these two classifier are performed for a particular dataset.

The content of this tutorial includes how to interpret the community matrix of 16S (also known as OTU table) with taxonomic classifications, and create bar charts.

library(gg1L)
library(ComMA)

Data set

This 16S data set is from Miyake et al 2015, which was sequenced on a Roche 454 GS FLX Titanium and processed before OTU clustering. Their QC steps performed were as follows: 1. use MOTHUR to denoise (320min, 800max flows); 2. trim barcodes and primers; 3. remove homopolymers of more than 8bp; 4. remove sequences < 200bp lengths; 5. remove chimeric sequences using UCHIME; 6. taxonomic calling against Greengenes May 2013 release; 7. subsequent removal of chloroplast, mictochondrial, and non-bacterial sequences.

OTU clustering and community matrix

UPARSE pipeline is used to create OTUs. The version of USEARCH used is 8.0.1623, and the threshold is set to 97%. The community matrix (OTU table) miyake.454.16s.otus.csv in the result is available.

#getwd()
cm <- readCommunityMatrix("../data-raw/miyake.454.16s.otus.csv", "16S", minAbund=1)

## 
## Upload 16S community matrix : 59 samples excluding row names from column 1, 2682 OTUs excluding column names from row 1, from file ../data-raw/miyake.454.16s.otus.csv 
## Remove 0 rows whose minimum abundance < 1 !

cm[1:3,]

##                       Agah01 Agah02 Agah03 Agah04 Anig01 Anig02 Anig03
## Agah01_H0COW3M02H2QER    379      0      6     16     18      0      1
## Agah01_H2ZDNKJ02G8AL6    961      0     60     24      9      4      1
## Agah01_H0COW3M02HMT2E    448      0      0      2      0      1      0
##                       Anig04 Asoh01 Asoh02 Asoh03 Asoh04 Asoh05 Asoh06
## Agah01_H0COW3M02H2QER      2    110     18      2    197     12    168
## Agah01_H2ZDNKJ02G8AL6      3    262      3     26    308     95      8
## Agah01_H0COW3M02HMT2E      0      0      0      0      0      0      0
##                       Asoh07 Asoh08 Csor01 Csor02 Csor03 Cstr01 Cstr02
## Agah01_H0COW3M02H2QER     23      0      3      0      0      7      6
## Agah01_H2ZDNKJ02G8AL6      4      0     12      0      0     18      1
## Agah01_H0COW3M02HMT2E      0      0   4542      0      0      2      1
##                       Cstr03 Nele01 Nele02 Nele03 Nele04 Nele05 Nele06
## Agah01_H0COW3M02H2QER      0      0      0      0      0      0      0
## Agah01_H2ZDNKJ02G8AL6      0   3294   5305   2679   1773   2453   3198
## Agah01_H0COW3M02HMT2E      0      0      0      0      0      0      0
##                       Nele07 Nele08 Nele09 Nele10 Nele11 Nhex01 Nhex02
## Agah01_H0COW3M02H2QER      0      0      0      0      0      0      0
## Agah01_H2ZDNKJ02G8AL6   3856   3102   2991   4455   5241      0      0
## Agah01_H0COW3M02HMT2E      0      0      0      0      0      0      0
##                       Nhex03 Nhex04 Nuni01 Nuni02 Nuni03 Snig01 Snig02
## Agah01_H0COW3M02H2QER      2      3      0      2      0      6      2
## Agah01_H2ZDNKJ02G8AL6      1      0   1478   1418   1204      2      4
## Agah01_H0COW3M02HMT2E      1      0      0      0      0      0      0
##                       Snig03 Sste01 Sste02 Sste03 Zdes01 Zdes02 Zdes03
## Agah01_H0COW3M02H2QER      0      0      0      0   2313   1706   1771
## Agah01_H2ZDNKJ02G8AL6      0      1      0      2     20    413    302
## Agah01_H0COW3M02HMT2E     25      0      0      0      0      0      0
##                       Zdes04 Zdes05 Zxan01 Zxan02 Zxan03 Zxan04 Zxan05
## Agah01_H0COW3M02H2QER    395    371    437   2540   2619    221    361
## Agah01_H2ZDNKJ02G8AL6      0   1275     19    627     16     28     29
## Agah01_H0COW3M02HMT2E      0      0      0      0      0      1      0
##                       Zxan06 Zxan07 Zxan08
## Agah01_H0COW3M02H2QER    650     76      1
## Agah01_H2ZDNKJ02G8AL6   1433     20      2
## Agah01_H0COW3M02HMT2E      0      0      0

summaryCM(cm)

##                        total
## reads                202,430
## OTUs                   2,682
## samples                   59
## singletons               939
## doubletons               353
## max.OTU.abundance     48,440
## min.OTU.abundance          1
## max.sample.abundance  10,612
## min.sample.abundance   1,054

Taxonomic classifier

The taxonomy of OTUs is first classified by the RDP classifier using the latest version of Greengenes database (May 2013). And then, the second taxonomy of same OTUs is classified by BLAST and interpreted by MEGAN.

The table of taxonomy from kingdom to genus is uploaded as below, where MEGAN repeats the last classified taxonomy at the higher rank through the lineage for all unclassified taxonomy, but RDP keeps them in blank.

BLAST + MEGAN

Note: Uncultured Bacteria (taxid:77133) is excluded during BLAST in order to give more accurate taxonomic classification.

The classification without excluding Uncultured Bacteria provides much worse result for this 16S dataset.

#BLAST
tt.megan <- readTaxaTable("../data-raw/miyake.454.16s.megan.txt", "16S megan taxa table", taxa.group="all")

## 
## Upload 16S megan taxa table taxonomy table : 7 columns excluding row names from column 1, 2682 OTUs excluding column names from row 1, from file ../data-raw/miyake.454.16s.megan.txt

# 1st column is taxonomic lineage path
tt.megan[1:3,-1]

##                        kingdom         phylum               class
## Agah01_H0COW3M02F0MYH Bacteria     Firmicutes          Firmicutes
## Agah01_H0COW3M02F0SWP     root           root                root
## Agah01_H0COW3M02F1ZMK Bacteria Proteobacteria Gammaproteobacteria
##                                     order              family
## Agah01_H0COW3M02F0MYH          Firmicutes          Firmicutes
## Agah01_H0COW3M02F0SWP                root                root
## Agah01_H0COW3M02F1ZMK Gammaproteobacteria Gammaproteobacteria
##                                     genus
## Agah01_H0COW3M02F0MYH          Firmicutes
## Agah01_H0COW3M02F0SWP                root
## Agah01_H0COW3M02F1ZMK Gammaproteobacteria

RDP + Greengenes

# RDP
tt.rdp <- readTaxaTable("../data-raw/miyake.454.16s.greengenes.txt", "16S rdp taxa table", taxa.group="all")

## 
## Upload 16S rdp taxa table taxonomy table : 9 columns excluding row names from column 1, 2682 OTUs excluding column names from row 1, from file ../data-raw/miyake.454.16s.greengenes.txt

# 1st column is taxonomic lineage path
tt.rdp[1:3,-1]

##                       confidence  kingdom         phylum
## Agah01_H0COW3M02F0MYH       0.53 Bacteria     Firmicutes
## Agah01_H0COW3M02F0SWP       0.79 Bacteria            TM7
## Agah01_H0COW3M02F1ZMK       0.91 Bacteria Proteobacteria
##                                     class         order          family
## Agah01_H0COW3M02F0MYH          Clostridia Clostridiales Lachnospiraceae
## Agah01_H0COW3M02F0SWP                                                  
## Agah01_H0COW3M02F1ZMK Gammaproteobacteria Legionellales                
##                             genus species
## Agah01_H0COW3M02F0MYH Coprococcus        
## Agah01_H0COW3M02F0SWP                    
## Agah01_H0COW3M02F1ZMK

Taxonomic assignments

The number of reads of OTUs assigned to different taxonomies is summarised below regarding to classifiers. Because this dataset is preprossed, RDP using 0.5 confidence threshold do not have any unclassified taxonomy.

# return a list of taxa assignments, names = c("megan", "rdp.5", "rdp.8")
getTaxaAssignList <- function(cm, tt.megan, tt.rdp, taxa.group="all", rank="kingdom") {
    cm.taxa <- mergeCMTaxa(cm, tt.megan)
    ta.megan <- assignTaxaByRank(cm.taxa)
    sum(ta.megan[[rank]])

    cm.taxa <- mergeCMTaxa(cm, tt.rdp, classifier="RDP")
    ta.rdp.8 <- assignTaxaByRank(cm.taxa)
    sum(ta.rdp.8[[rank]])

    cm.taxa <- mergeCMTaxa(cm, tt.rdp, classifier="RDP", min.conf=0.5)
    ta.rdp.5 <- assignTaxaByRank(cm.taxa)
    sum(ta.rdp.5[[rank]])

    list(megan=ta.megan, rdp.5=ta.rdp.5, rdp.8=ta.rdp.8)
}

getReads <- function(ta.list, rank="kingdom") {
    reads.megan <- ta.list[["megan"]][[rank]]
    reads.megan$classifier <- "MEGAN"
    reads.megan[,rank] <- rownames(reads.megan)
    reads.rdp.8 <- ta.list[["rdp.8"]][[rank]]
    reads.rdp.8$classifier <- "RDP 0.8"
    reads.rdp.8[,rank] <- rownames(reads.rdp.8)
    reads.rdp.5 <- ta.list[["rdp.5"]][[rank]]
    reads.rdp.5$classifier <- "RDP 0.5"
    reads.rdp.5[,rank] <- rownames(reads.rdp.5)

    reads <- rbind(reads.megan, reads.rdp.8, reads.rdp.5)
}

Assign taxonomy at different rank level

ta.list <- getTaxaAssignList(cm, tt.megan, tt.rdp)

## Parse classification from MEGAN classifier , and preprocess taxanomy names. 
## Preprocess taxonomy names at rank columns : kingdom,phylum,class,order,family 
## Merge 2682 rows in community matrix with 2682 rows in taxa table, get 2682 classified OTUs.
## Preprocess taxonomy names at rank columns : kingdom,phylum,class,order,family 
## Parse classification from RDP classifier , and preprocess taxanomy names. 
## Set 914 rows as 'unclassified' from the total of 2682 in RDP taxa table, whose confidence < 0.8 .
## Preprocess taxonomy names at rank columns : kingdom,phylum,class,order,family 
## Merge 2682 rows in community matrix with 2682 rows in taxa table, get 2682 classified OTUs.
## Preprocess taxonomy names at rank columns : kingdom,phylum,class,order,family 
## Parse classification from RDP classifier , and preprocess taxanomy names. 
## Set 0 rows as 'unclassified' from the total of 2682 in RDP taxa table, whose confidence < 0.5 .
## Preprocess taxonomy names at rank columns : kingdom,phylum,class,order,family 
## Merge 2682 rows in community matrix with 2682 rows in taxa table, get 2682 classified OTUs.
## Preprocess taxonomy names at rank columns : kingdom,phylum,class,order,family

names(ta.list[["megan"]])

## [1] "kingdom" "phylum"  "class"   "order"   "family"

ta.list[["megan"]][["kingdom"]]

##               total
## Bacteria     194527
## Eukaryota        62
## unclassified   7841

ta.list[["rdp.8"]][["phylum"]]

##                        total
## Acidobacteria             87
## Actinobacteria          6761
## Bacteroidetes           3860
## Chlamydiae                 8
## Chloroflexi               13
## Cyanobacteria           2376
## Deferribacteres           15
## Elusimicrobia             10
## Fibrobacteres              1
## Firmicutes            120806
## Fusobacteria            8619
## GN02                       1
## Lentisphaerae             22
## Planctomycetes             1
## Proteobacteria         18101
## SBR1093                   24
## Spirochaetes             441
## Synergistetes             71
## Tenericutes              250
## Thermi                     2
## TM6                      109
## TM7                       52
## Verrucomicrobia          569
## WPS-2                     17
## WS2                        1
## WS3                        1
## unclassified Bacteria     43
## unclassified           40169

Abundance percentage bar of RDP using 0.5 and 0.8 confidence threshold

ta <- merge(ta.list[["rdp.5"]][["phylum"]], ta.list[["rdp.8"]][["phylum"]], 
            by = "row.names", all = TRUE)
ta[is.na(ta)] <- 0
colnames(ta) <- c("phylum", "rdp.5", "rdp.8")

bar.per <- ggPercentageBarChart(ta, melt.id="phylum", verbose=F, title="Abundance Percentage Bar")
bar.per$gg.plot

Abundance summary

grid_arrange_shared_legend shares a legend between multiple plots, and returns a gtable object:

kingdom <- getReads(ta.list)
kingdom

##                total classifier      kingdom
## Bacteria      194527      MEGAN     Bacteria
## Eukaryota         62      MEGAN    Eukaryota
## unclassified    7841      MEGAN unclassified
## Bacteria1     162261    RDP 0.8     Bacteria
## unclassified1  40169    RDP 0.8 unclassified
## Bacteria2     202430    RDP 0.5     Bacteria

bar.kingdom <- ggBarChart(kingdom, x.id="kingdom", y.id="total",  
                        fill.id="classifier", y.trans="log", y.lab="reads", 
                        title="Classification Including Singletons", verbose=F)
bar.kingdom <- bar.kingdom + ggAddNumbers(label.id="total", text.size=3, 
                                          hjust=0.5, vjust=2)
bar.kingdom

Bacteria community without singleton

# no singleton
cm.min2 <- readCommunityMatrix("../data-raw/miyake.454.16s.otus.csv", "16S")

## 
## Upload 16S community matrix : 59 samples excluding row names from column 1, 2682 OTUs excluding column names from row 1, from file ../data-raw/miyake.454.16s.otus.csv 
## Remove 939 rows whose minimum abundance < 2 !

summaryCM(cm.min2)

##                        total
## reads                201,491
## OTUs                   1,743
## samples                   59
## singletons                 0
## doubletons               353
## max.OTU.abundance     48,440
## min.OTU.abundance          2
## max.sample.abundance  10,520
## min.sample.abundance   1,054

# no singleton
ta.list <- getTaxaAssignList(cm.min2, tt.megan, tt.rdp)

## Parse classification from MEGAN classifier , and preprocess taxanomy names. 
## Preprocess taxonomy names at rank columns : kingdom,phylum,class,order,family 
## Merge 1743 rows in community matrix with 2682 rows in taxa table, get 1743 classified OTUs.
## Preprocess taxonomy names at rank columns : kingdom,phylum,class,order,family 
## Parse classification from RDP classifier , and preprocess taxanomy names. 
## Set 914 rows as 'unclassified' from the total of 2682 in RDP taxa table, whose confidence < 0.8 .
## Preprocess taxonomy names at rank columns : kingdom,phylum,class,order,family 
## Merge 1743 rows in community matrix with 2682 rows in taxa table, get 1743 classified OTUs.
## Preprocess taxonomy names at rank columns : kingdom,phylum,class,order,family 
## Parse classification from RDP classifier , and preprocess taxanomy names. 
## Set 0 rows as 'unclassified' from the total of 2682 in RDP taxa table, whose confidence < 0.5 .
## Preprocess taxonomy names at rank columns : kingdom,phylum,class,order,family 
## Merge 1743 rows in community matrix with 2682 rows in taxa table, get 1743 classified OTUs.
## Preprocess taxonomy names at rank columns : kingdom,phylum,class,order,family

kingdom.min2 <- getReads(ta.list)
kingdom.min2

##                total classifier      kingdom
## Bacteria      193839      MEGAN     Bacteria
## Eukaryota         49      MEGAN    Eukaryota
## unclassified    7603      MEGAN unclassified
## Bacteria1     161650    RDP 0.8     Bacteria
## unclassified1  39841    RDP 0.8 unclassified
## Bacteria2     201491    RDP 0.5     Bacteria

merge.kingdom <- merge(kingdom, kingdom.min2, by=c("kingdom", "classifier"))
# informal code to select Bacteria
bacteria <- merge.kingdom[merge.kingdom=="Bacteria",]
bacteria$percentage <- bacteria$total.y / bacteria$total.x
bacteria$singletons <- 1 - bacteria$percentage
bacteria

##    kingdom classifier total.x total.y percentage  singletons
## 1 Bacteria      MEGAN  194527  193839  0.9964632 0.003536784
## 2 Bacteria    RDP 0.5  202430  201491  0.9953614 0.004638641
## 3 Bacteria    RDP 0.8  162261  161650  0.9962345 0.003765538

ggBarChart(bacteria, x.id="classifier", y.id="singletons", y.trans="per", 
           title="Percentage of Singletons Classified as Bacteria", verbose=F)

Classification comparison

# select Bacteria
tt.megan <- readTaxaTable("../data-raw/miyake.454.16s.megan.txt", "16S megan taxa table", taxa.group="Bacteria")

## 
## Upload 16S megan taxa table taxonomy table : 7 columns excluding row names from column 1, 2682 OTUs excluding column names from row 1, from file ../data-raw/miyake.454.16s.megan.txt 
## Select 2069 classifications, by given taxa.group = Bacteria , rank = kingdom , include = TRUE .

tt.rdp <- readTaxaTable("../data-raw/miyake.454.16s.greengenes.txt", "16S rdp taxa table", taxa.group="Bacteria")

## 
## Upload 16S rdp taxa table taxonomy table : 9 columns excluding row names from column 1, 2682 OTUs excluding column names from row 1, from file ../data-raw/miyake.454.16s.greengenes.txt

n.taxa.df <- data.frame(row.names=c("MEGAN", "RDP 0.5", "RDP 0.8"))
reads.df <- data.frame(row.names=c("MEGAN", "RDP 0.5", "RDP 0.8"))
r <- 1
for (ta.name in c(c("megan", "rdp.5", "rdp.8"))) {
    print(paste("Classifier :", ta.name))
    ta.df <- ta.list[[ta.name]]
    # start from phylum
    for (ra in names(ta.df)[-1]) {
        ta <- ta.df[[ra]]
        n.uncl <- length(grep("unclassified", rownames(ta), ignore.case = T))    
        print(paste(ra, ": total taxa =", nrow(ta), ", where", 
                    n.uncl, "unclassified taxa."))    
        # not count "unclassified"
        n.taxa.df[r, ra] <- nrow(ta) - n.uncl

        ta <- subset(ta, !grepl("unclassified", rownames(ta), ignore.case = T))
        reads.df[r, ra] <- sum(ta[,"total"])
    }
    r <- r + 1
}

## [1] "Classifier : megan"
## [1] "phylum : total taxa = 21 , where 3 unclassified taxa."
## [1] "class : total taxa = 53 , where 17 unclassified taxa."
## [1] "order : total taxa = 82 , where 28 unclassified taxa."
## [1] "family : total taxa = 97 , where 38 unclassified taxa."
## [1] "Classifier : rdp.5"
## [1] "phylum : total taxa = 22 , where 1 unclassified taxa."
## [1] "class : total taxa = 61 , where 14 unclassified taxa."
## [1] "order : total taxa = 102 , where 22 unclassified taxa."
## [1] "family : total taxa = 164 , where 39 unclassified taxa."
## [1] "Classifier : rdp.8"
## [1] "phylum : total taxa = 23 , where 2 unclassified taxa."
## [1] "class : total taxa = 59 , where 14 unclassified taxa."
## [1] "order : total taxa = 99 , where 22 unclassified taxa."
## [1] "family : total taxa = 146 , where 35 unclassified taxa."

Classified taxonomy under Bacteria (no singletons)

# result
n.taxa.df

##         phylum class order family
## MEGAN       18    36    54     59
## RDP 0.5     21    47    80    125
## RDP 0.8     21    45    77    111

n.taxa.df$id = rownames(n.taxa.df)
n.taxa.melt <- melt(n.taxa.df)

## Using id as id variables

ggBarChart(n.taxa.melt, x.id="variable", y.id="value", fill.id="id", 
           x.lab="rank", y.lab="classified taxonomy",
           title="The Number Of Classified Taxonomy Under Bacteria (No Singletons)",
           legend.title="classifier", verbose=F) + 
ggAddNumbers(label.id="value", text.size=3, hjust=0.5, vjust=2)

Reads of classified OTUs under Bacteria (no singletons)

# result
reads.df

##         phylum  class  order family
## MEGAN   139899 117891  41771  22989
## RDP 0.5 201458 189928 188391 178640
## RDP 0.8 161617 154821 154180 146210

reads.df$id = rownames(reads.df)
reads.melt <- melt(reads.df)

## Using id as id variables

ggBarChart(reads.melt, x.id="variable", y.id="value", fill.id="id", 
           x.lab="rank", y.lab="reads of classified OTUs", 
           title="Reads Of Classified OTUs Under Bacteria (No Singletons)",
           legend.title="classifier", verbose=F) + 
ggAddNumbers(label.id="value", text.size=3, hjust=0.5, vjust=2)

Conclusion

Excluding uncultured bacteria (taxid:77133) during BLAST will increase the accuracy of taxonomic classification for 16S dataset.
RDP using 0.5 confidence threshold picks up more singletons than RDP 0.8 and MEGAN, if singletone are not removed.
RDP + Greengenes can detect more or similar number of taxonomies than MEGAN before down to family level; at the family level the comparison is varied according to RDP confidence threshold; below family RDP suddenly loses its power, which I presume it is caused by the lack of reference below family level.
RDP using 0.5 confidence threshold always keeps more reads from classified OTUs than MEGAN, but RDP 0.8 has a big gap to reach their abundance above family level; at the family level the reads classified by MEGAN is significately dropped, this mean many abundant OTUs are unable to assign the classification at family level; the same drops for RDP happen below family level.

Take-home message

RDP + Greengenes using 0.8 confidence threshold is recommanded to make the taxonomic classification for 16S dataset. But dropping confidence threshold with concerns could increase the abundance of classified taxonomy above genus.

RDP + Greengenes vs. BLAST + MEGAN

Walter Xie

25/05/2016