In ecology, scientists collect quantitative data to determine the resemblance between either the objects under study (objects) or the variables describing them (species or other descriptors). The analysis of the association between objects and descriptors is the first step in the analysis.
There are three main types of association:
After creating these measures of association, common tasks are to cluster or to ordinate.
To cluster: Sets of objects (or descriptors) can be partitioned into two or more subsets (clusters) using pre-established rules of agglomeration.
To ordinate: Objects (or descriptors) are represented in a space containing fewer dimensions than the original dataset such that the positions of the objects (or descriptors) with respect to one another may also be used to cluster them.
As scientists interested in the microbiome, we are now all microbial ecologists. We typically talk about objects as subjects from whom samples have been taken (or even sites within subjects, e.g., oral, vaginal, rectal) while the descriptors are the list of abundances of each species (or other taxonomic classification) found.
So we become interested in how similar subjects are to each other, say, within group i, and how distant they are from, say, group j. Then we might want to know if that distance is associated with some other difference between the groups, for instance, are groups that are closer together in site pH levels also more similar (i.e., less distant).
Other questions of interest: how do sites (or species) cluster together? Also, can you use a multivariate dimensionality reduction technique to describe differences while preserving distances?
Recall if it is appropriate to determine distances between two points with Euclidean distance, the ensuing multidimensional reduction technique is principal component analysis (PCA). If the data are counts then the distance should be calculated in terms of the \(\chi^2\) distance, and then the reduction technique is canonical component analysis (CCA). If the distance metrics take on some other value (e.g., dissimilarity), then principal coordinate analysis (PCoA), which is really PCA of the squared distance matrix.
These are the types of questions that we can use phyloseq
to address with our OTUtables and our sample data.
phyloseq
is an R package for efficient interactive and reproducible analyses of OTU-clustered high-throughput phylogenetic sequencing data. We can take the sequence variant table produced by dada2
and use it, along with taxonomic classification and data about the samples, in these analyses. Phyloseq
imports, stores, analyzes, and displays these data.
In our last lesson we showed how to hand off data to phyloseq
from dada2
, using the phyloseq
command to create a phyloseq
object from an OTU table, a taxonomic table, and a sample dataset.
In this lesson we will some of the other functionality of phyloseq
:
phyloseq
objects from the output of other programs;To get started we will need to load both phyloseq
and ggplot2
.
# Load packages
library(phyloseq)
packageVersion("phyloseq")
## [1] '1.22.3'
library(ggplot2)
packageVersion("ggplot2")
## [1] '2.2.1'
library(RColorBrewer)
packageVersion("RColorBrewer")
## [1] '1.1.2'
Load example data using the data()
command.
# Load example data
data(GlobalPatterns)
data(esophagus)
data(enterotype)
data(soilrep)
You can try these examples using the example()
command.
# Use the example() command on the enterotype database
example(enterotype, ask=FALSE)
##
## entrty> data(enterotype)
##
## entrty> ig <- make_network(enterotype, "samples", max.dist=0.3)
##
## entrty> plot_network(ig, enterotype, color="SeqTech", shape="Enterotype", line_weight=0.3, label=NULL)
## Warning: attributes are not identical across measure variables; they will
## be dropped
You can also import data from other packages such as MG-RAST or QIIME. The newer versions of QIIME produce files formatted according to the biom standard (see http://biom-format.org).
The phyloseq package has small examples of biom files with different levels and organization of data. The following chunk illustrates how to import each of the 4 main types of biom files. Note that in practice you won’t know what type your file is, so best to include all 4. In addition, the import_biom
funciton allows you to import an associated phylogenetic tree file and reference sequence file (e.g., fasta file).
To do this you must first define the file paths. For this example, this should be within the phyloseq package. Thus we use the system.file
command to get the paths.
# Load the files
rich_dense_biom <- system.file("extdata", "rich_dense_otu_table.biom", package="phyloseq")
rich_sparse_biom <- system.file("extdata", "rich_sparse_otu_table.biom", package="phyloseq")
min_dense_biom <- system.file("extdata", "min_dense_otu_table.biom", package="phyloseq")
min_sparse_biom <- system.file("extdata", "min_sparse_otu_table.biom", package="phyloseq")
treefilename <- system.file("extdata", "biom-tree.phy", package="phyloseq")
refseqfilename <- system.file("extdata", "biom-refseq.fasta", package="phyloseq")
Next we use these paths as argments to the import_biom
function. The tree and reference sequence files are both suitable for any of the biom file types, so we only need one path for each.
# Import via the paths
import_biom(rich_dense_biom, treefilename, refseqfilename, parseFunction = parse_taxonomy_greengenes)
## phyloseq-class experiment-level object
## otu_table() OTU Table: [ 5 taxa and 6 samples ]
## sample_data() Sample Data: [ 6 samples by 4 sample variables ]
## tax_table() Taxonomy Table: [ 5 taxa by 7 taxonomic ranks ]
## phy_tree() Phylogenetic Tree: [ 5 tips and 4 internal nodes ]
## refseq() DNAStringSet: [ 5 reference sequences ]
import_biom(rich_sparse_biom, treefilename, refseqfilename, parseFunction = parse_taxonomy_greengenes)
## phyloseq-class experiment-level object
## otu_table() OTU Table: [ 5 taxa and 6 samples ]
## sample_data() Sample Data: [ 6 samples by 4 sample variables ]
## tax_table() Taxonomy Table: [ 5 taxa by 7 taxonomic ranks ]
## phy_tree() Phylogenetic Tree: [ 5 tips and 4 internal nodes ]
## refseq() DNAStringSet: [ 5 reference sequences ]
import_biom(min_dense_biom, treefilename, refseqfilename, parseFunction = parse_taxonomy_greengenes)
## phyloseq-class experiment-level object
## otu_table() OTU Table: [ 5 taxa and 6 samples ]
## phy_tree() Phylogenetic Tree: [ 5 tips and 4 internal nodes ]
## refseq() DNAStringSet: [ 5 reference sequences ]
import_biom(min_sparse_biom, treefilename, refseqfilename, parseFunction = parse_taxonomy_greengenes)
## phyloseq-class experiment-level object
## otu_table() OTU Table: [ 5 taxa and 6 samples ]
## phy_tree() Phylogenetic Tree: [ 5 tips and 4 internal nodes ]
## refseq() DNAStringSet: [ 5 reference sequences ]
In practice you will store the result of your import as some variable name, like myData
, then use this object in downstream analyses. For examples,
# Example of storing results of import and manipulating it
myData <- import_biom(rich_dense_biom, treefilename, refseqfilename, parseFunction = parse_taxonomy_greengenes)
myData
## phyloseq-class experiment-level object
## otu_table() OTU Table: [ 5 taxa and 6 samples ]
## sample_data() Sample Data: [ 6 samples by 4 sample variables ]
## tax_table() Taxonomy Table: [ 5 taxa by 7 taxonomic ranks ]
## phy_tree() Phylogenetic Tree: [ 5 tips and 4 internal nodes ]
## refseq() DNAStringSet: [ 5 reference sequences ]
plot_tree(myData, color = "Genus", shape = "BODY_SITE", size = "abundance")
plot_richness(myData, x="BODY_SITE", color="Description")
## Warning: Removed 33 rows containing missing values (geom_errorbar).
plot_bar(myData, fill="Genus")
refseq(myData)
## A DNAStringSet instance of length 5
## width seq names
## [1] 334 AACGTAGGTCACAAGCGTTGT...TTCCGTGCCGGAGTTAACAC GG_OTU_1
## [2] 465 TACGTAGGGAGCAAGCGTTAT...CCTTACCAGGGCTTGACATA GG_OTU_2
## [3] 249 TACGTAGGGGGCAAGCGTTAT...GGCTCGAAAGCGTGGGGAGC GG_OTU_3
## [4] 453 TACGTATGGTGCAAGCGTTAT...AAGCAACGCGAAGAACCTTA GG_OTU_4
## [5] 178 AACGTAGGGTGCAAGCGTTGT...GGAATGCGTAGATATCGGGA GG_OTU_5
The good people at phyloseq
have also downloaded data from the Human Microbiome Project and made it available as an R dataset. So let’s look at it
# Get the HMP data
# You will need to change your path
load("~/Downloads/HMPv35.RData")
HMPv35
## phyloseq-class experiment-level object
## otu_table() OTU Table: [ 45336 taxa and 4743 samples ]
## sample_data() Sample Data: [ 4743 samples by 9 sample variables ]
## tax_table() Taxonomy Table: [ 45336 taxa by 6 taxonomic ranks ]
## phy_tree() Phylogenetic Tree: [ 45336 tips and 45099 internal nodes ]
## refseq() DNAStringSet: [ 45336 reference sequences ]
phyloseq
supports two types of merging.
merge_samples()
or merge_taxa()
commands.merge_phyloseq()
commend.Merging samples with merge_samples()
can be a useful means of reducing noise or excess features, say by removing the individual effects between replicates or between samples for a particular explanatory variable. With merge_samples()
the abundance values for the merged samples are summed.
As an example, let’s first remove unobserved OTUs (those that sum 0 across all samples) and add an explanatory variable with which to organize later in plots, using the GlobalPatterns dataset.
# Remove empty samples
GP <- GlobalPatterns
GP <- prune_taxa(taxa_sums(GlobalPatterns) > 0, GlobalPatterns)
humantypes <- c("Feces", "Mock", "Skin", "Tongue")
sample_data(GP)$human <- get_variable(GP, "SampleType") %in% humantypes
Now on to merging samples:
# Merge Samples
mergedGP <- merge_samples(GP, "SampleType")
SD <- merge_samples(sample_data(GP), "SampleType")
print(SD[, "SampleType"])
## SampleType
## Feces 1
## Freshwater 2
## Freshwater (creek) 3
## Mock 4
## Ocean 5
## Sediment (estuary) 6
## Skin 7
## Soil 8
## Tongue 9
print(mergedGP)
## phyloseq-class experiment-level object
## otu_table() OTU Table: [ 18988 taxa and 9 samples ]
## sample_data() Sample Data: [ 9 samples by 8 sample variables ]
## tax_table() Taxonomy Table: [ 18988 taxa by 7 taxonomic ranks ]
## phy_tree() Phylogenetic Tree: [ 18988 tips and 18987 internal nodes ]
sample_names(GP)
## [1] "CL3" "CC1" "SV1" "M31Fcsw" "M11Fcsw" "M31Plmr"
## [7] "M11Plmr" "F21Plmr" "M31Tong" "M11Tong" "LMEpi24M" "SLEpi20M"
## [13] "AQC1cm" "AQC4cm" "AQC7cm" "NP2" "NP3" "NP5"
## [19] "TRRsed1" "TRRsed2" "TRRsed3" "TS28" "TS29" "Even1"
## [25] "Even2" "Even3"
sample_names(mergedGP)
## [1] "Feces" "Freshwater" "Freshwater (creek)"
## [4] "Mock" "Ocean" "Sediment (estuary)"
## [7] "Skin" "Soil" "Tongue"
identical(SD, sample_data(mergedGP))
## [1] TRUE
The OTU abundances of merged samples are summed. Let’s investigate this ourselves looking at just the top 10 most abundant OTUs.
# Look at top 10 most abundant OTUs
OTUnames10 <- names(sort(taxa_sums(GP), TRUE)[1:10])
GP10 <- prune_taxa(OTUnames10, GP)
mGP10 <- prune_taxa(OTUnames10,mergedGP)
ocean_samples <- sample_names(subset(sample_data(GP), SampleType == "Ocean"))
print(ocean_samples)
## [1] "NP2" "NP3" "NP5"
otu_table(GP10)[, ocean_samples]
## OTU Table: [10 taxa and 3 samples]
## taxa are rows
## NP2 NP3 NP5
## 329744 91 126 120
## 317182 3148 12370 63084
## 549656 5045 10713 1784
## 279599 113 114 126
## 360229 16 83 786
## 94166 49 128 709
## 550960 11 86 65
## 158660 13 39 28
## 331820 24 101 105
## 189047 4 33 29
rowSums(otu_table(GP10)[, ocean_samples])
## 329744 317182 549656 279599 360229 94166 550960 158660 331820 189047
## 337 78602 17542 353 885 886 162 80 230 66
otu_table(mGP10)["Ocean", ]
## OTU Table: [10 taxa and 1 samples]
## taxa are columns
## 329744 317182 549656 279599 360229 94166 550960 158660 331820 189047
## Ocean 337 78602 17542 353 885 886 162 80 230 66
Now let’s look at the merge graphically between two richnes estimate summary plots.
# Richness estimate plots
plot_richness(GP, "human", "SampleType", title="Unmerged")
## Warning: Removed 130 rows containing missing values (geom_errorbar).
The merge can do some weird things to samples variables, so let’s re-add these variables to the sample_data
before we plot.
# Re-add sample data and plot richness plots again.
sample_data(mergedGP)$SampleType <- sample_names(mergedGP)
sample_data(mergedGP)$human <- sample_names(mergedGP) %in% humantypes
plot_richness(mergedGP, "human", "SampleType", title="merged")
## Warning: Removed 45 rows containing missing values (geom_errorbar).
When we combine the abundances of non-replicate samples from the same environment, the estimates of absolute richness increase for each environment. But, more to the point, the new plot is easier to read and interpret, one of the reasons one might use the merge_samples
function.
One of the issues with microbial census data is that a fine-scaled definition for OTU may blur a pattern that might otherwise be evident if we considered a higher taxonomic rank. The best way to deal with this is to agglomerate phylogenetic tree tips or taxa by using the agglomeration functions tip_glom
or tax_glom
. Let’s use the object we imported earlier from the biom format files, myData. First the original tree:
# Plot phylogenetic tree of GlobalPatterns object
plot_tree(GP, color="SampleType", sizebase=2, label.tips="taxa_name")
What a mess. Let’s try using merge_taxa
to merge the first 100 OTUs into one new OTU. The option 2 in the call means to combine the counts of these 100 OTUs into the index for the new OTU.
# Combine the tree tips and plot again
x1 <- merge_taxa(GP, taxa_names(GP)[1:100], 2)
plot_tree(x1, color="SampleType", sizebase=2, label.tips="taxa_name")
You can barely see the difference, but it is there.
You can also merge phyloseq objects. We will demonstrate with a somewhat trivial example by extracting the components of the GlobalPatterns object and then building them back to the original form using the command merge_phyloseq
.
# Split apart GlobalPatterns
data(GlobalPatterns)
tree <- phy_tree(GlobalPatterns)
tax <- tax_table(GlobalPatterns)
otu <- otu_table(GlobalPatterns)
sam <- sample_data(GlobalPatterns)
# Create new phyloseq object
otutax <- phyloseq(otu, tax)
otutax
## phyloseq-class experiment-level object
## otu_table() OTU Table: [ 19216 taxa and 26 samples ]
## tax_table() Taxonomy Table: [ 19216 taxa by 7 taxonomic ranks ]
You can se that our new otutax
object has just the OTU table and the taxonomy table. Let’s use merge_phyloseq
to build up the original GlobalPatterns
object, and compare to make sure they are identical.
# Build back up
GP2 = merge_phyloseq(otutax, sam, tree)
# Test for identity
identical(GP2, GlobalPatterns)
## [1] TRUE
We have already seen a bit above about how to access some of the data within a phyloseq
object. We will comprehensively cover it now.
Let’s use the GlobalPatterns
dataset.
# Load data object and see what is there
data("GlobalPatterns")
GlobalPatterns
## phyloseq-class experiment-level object
## otu_table() OTU Table: [ 19216 taxa and 26 samples ]
## sample_data() Sample Data: [ 26 samples by 7 sample variables ]
## tax_table() Taxonomy Table: [ 19216 taxa by 7 taxonomic ranks ]
## phy_tree() Phylogenetic Tree: [ 19216 tips and 19215 internal nodes ]
ntaxa(GlobalPatterns)
## [1] 19216
nsamples(GlobalPatterns)
## [1] 26
sample_names(GlobalPatterns)[1:5]
## [1] "CL3" "CC1" "SV1" "M31Fcsw" "M11Fcsw"
rank_names(GlobalPatterns)
## [1] "Kingdom" "Phylum" "Class" "Order" "Family" "Genus" "Species"
sample_variables(GlobalPatterns)
## [1] "X.SampleID" "Primer"
## [3] "Final_Barcode" "Barcode_truncated_plus_T"
## [5] "Barcode_full_length" "SampleType"
## [7] "Description"
otu_table(GlobalPatterns)[1:5, 1:5]
## OTU Table: [5 taxa and 5 samples]
## taxa are rows
## CL3 CC1 SV1 M31Fcsw M11Fcsw
## 549322 0 0 0 0 0
## 522457 0 0 0 0 0
## 951 0 0 0 0 0
## 244423 0 0 0 0 0
## 586076 0 0 0 0 0
tax_table(GlobalPatterns)[1:5, 1:4]
## Taxonomy Table: [5 taxa by 4 taxonomic ranks]:
## Kingdom Phylum Class Order
## 549322 "Archaea" "Crenarchaeota" "Thermoprotei" NA
## 522457 "Archaea" "Crenarchaeota" "Thermoprotei" NA
## 951 "Archaea" "Crenarchaeota" "Thermoprotei" "Sulfolobales"
## 244423 "Archaea" "Crenarchaeota" "Sd-NA" NA
## 586076 "Archaea" "Crenarchaeota" "Sd-NA" NA
phy_tree(GlobalPatterns)
##
## Phylogenetic tree with 19216 tips and 19215 internal nodes.
##
## Tip labels:
## 549322, 522457, 951, 244423, 586076, 246140, ...
## Node labels:
## , 0.858.4, 1.000.154, 0.764.3, 0.995.2, 1.000.2, ...
##
## Rooted; includes branch lengths.
taxa_names(GlobalPatterns)[1:10]
## [1] "549322" "522457" "951" "244423" "586076" "246140" "143239"
## [8] "244960" "255340" "144887"
myTaxa <- names(sort(taxa_sums(GlobalPatterns), decreasing = TRUE)[1:10])
ex1 <- prune_taxa(myTaxa, GlobalPatterns)
plot(phy_tree(ex1), show.node.label = TRUE)
plot_tree(ex1, color = "SampleType", label.tips = "Phylum", ladderize = "left", justify = "left", size = "Abundance")
phyloseq
also includes functions for filtering, subsetting, and merging abundance data. Let’s filter!
Filtering is designed in a modular fashion. The functions prune_taxa
and prune_samples
will directly remove unwanted indices. The functions filterfun_sample
and genefilter_sample
will allow you to build arbitrarily complex sample-wise filtering criteria. The function filter_taxa
allows for taxa-wise filtering.
In the following example we will use the GlobalPatterns object again, first transforming it to relative abundance in a new GPr
object, then filtering that object to remove all OTUs that have a mean abundance < \(10^-5\), creating another new object.
# Create the object with the relative abundance data
GPr <- transform_sample_counts(GlobalPatterns, function(x) x / sum(x))
GPfr <- filter_taxa(GPr, function(x) mean(x) > 1e-5, TRUE)
GlobalPatterns
## phyloseq-class experiment-level object
## otu_table() OTU Table: [ 19216 taxa and 26 samples ]
## sample_data() Sample Data: [ 26 samples by 7 sample variables ]
## tax_table() Taxonomy Table: [ 19216 taxa by 7 taxonomic ranks ]
## phy_tree() Phylogenetic Tree: [ 19216 tips and 19215 internal nodes ]
GPr
## phyloseq-class experiment-level object
## otu_table() OTU Table: [ 19216 taxa and 26 samples ]
## sample_data() Sample Data: [ 26 samples by 7 sample variables ]
## tax_table() Taxonomy Table: [ 19216 taxa by 7 taxonomic ranks ]
## phy_tree() Phylogenetic Tree: [ 19216 tips and 19215 internal nodes ]
GPfr
## phyloseq-class experiment-level object
## otu_table() OTU Table: [ 4624 taxa and 26 samples ]
## sample_data() Sample Data: [ 26 samples by 7 sample variables ]
## tax_table() Taxonomy Table: [ 4624 taxa by 7 taxonomic ranks ]
## phy_tree() Phylogenetic Tree: [ 4624 tips and 4623 internal nodes ]
We see that the final object, GPfr
is highly subsetted containing just a little less than a quarter of the original OTUs (4624 / 19216).
The functions prune_sample
and prune_taxa
are to be used in cases where the complete subset of OTUS or samples is directly available. Alternatively the subsets_taxa
and subset_samples
are best for subsetting based on other data contained in the taxonomy and sample datasets respectively.
As an example we will obtain the subset of the GlobalPatterns dataset that is part of the Chlamydiae phylum, then remove samples with less than 20 reads.
# Subset for Chlamydiae
GP.ch1 <- subset_taxa(GlobalPatterns, Phylum == "Chlamydiae")
# Eliminate samples whose total reads are less than 20
GP.ch1 <- prune_samples(sample_sums(GP.ch1) >= 20, GP.ch1)
GP.ch1
## phyloseq-class experiment-level object
## otu_table() OTU Table: [ 21 taxa and 9 samples ]
## sample_data() Sample Data: [ 9 samples by 7 sample variables ]
## tax_table() Taxonomy Table: [ 21 taxa by 7 taxonomic ranks ]
## phy_tree() Phylogenetic Tree: [ 21 tips and 20 internal nodes ]
plot_tree(GP.ch1)
Merging methods include merge_taxa
and merge_samples
for merging specific OTUs or samples, respectively.
The following chunk shows the merge of the first 5 OTUs in teh Chlamydiae-only dataset.
# Merge first 5 OTUSs in Chlamydiae-only dataset
GP.ch1.merged <- merge_taxa(GP.ch1, taxa_names(GP.ch1)[1:5])
Let’s do one final set of preprocessing steps
# Final preprocess for GlobalPatterns: filter out taxa not see more than 3 times in at least
# 20% of samples
GP <- filter_taxa(GlobalPatterns, function(x) sum(x > 3) > (0.2*length(x)), TRUE)
# Define human v non-human and add to sample data:
sample_data(GP)$human <- factor(get_variable(GP, "SampleType") %in% c("Feces", "Mock", "Skin", "Tongue"))
# Standardize abundances to median sequencing depth
total <- median(sample_sums(GP))
standf <- function(x, t=total) round(t * (x / sum(x)))
gps <- transform_sample_counts(GP, standf)
# Filter out taxa with CV > 3.0
gpsf <- filter_taxa(gps, function(x) sd(x) / mean(x) > 3.0, TRUE)
# Subset to Bacteroidetes
gpsfb <- subset_taxa(gpsf, Phylum == "Bacteroidetes")
Let’s graph this slice of data
# Graph the slice of data
# First a simple bar graph
title = "plot_bar; Bacteroidetes-only"
plot_bar(gpsfb, "SampleType", "Abundance", title = title)
# Now colored by family
plot_bar(gpsfb, "SampleType", "Abundance", "Family", title = title)
# Now faceted with type of samples
plot_bar(gpsfb, "Family", "Abundance", "Family", title = title, facet_grid = "SampleType~.")
You can also define your own graphics “from scratch” using psmelt
Here are two examples:
# Example 1
mdf <- psmelt(gpsfb)
# Simple bar plot
ggplot(mdf, aes(x=SampleType,
y=Abundance)) +
geom_bar(stat="identity", position = "stack", color="black")
# Simple heat map
ggplot(mdf, aes(x=SampleType,
y=OTU,
fill=Abundance)) +
geom_raster()
The distance
function takes a phyloseq
object and a method and returns a dist
-class distance object. Currently phyloseq
supports 44 different method options. For the complete list, type distanceMethodList
. Use the documentation on distance
for further details.
The usage is as follows:
distance(phyloseq-object, method=“methodname”, type=“type”)
Because this function organizes distance calculations into one function, it is easy to calculate all distance methods and investigate the results. We are going to do so on the Enterotypes
dataset, then perform multidimensional scaling, plot the first two axes, shading and shaping the points in each plot according to sequencing technology and assigned “Enterotype” label.
First some preliminaries, loading plyr
, setting the ggplot2
theme, and loading the dataset:
# get plyr
library(plyr)
# set ggplot2 theme
theme_set(theme_bw())
# load enterotype data
data(enterotype)
Some preliminary filtering to remove OTUs that have unassigned sequences (“-1”):
# Remove unassigned OTUs
enterotype <- subset_taxa(enterotype, Genus != "-1")
Let’s see the available distance methods coded in distance
:
dist_methods <- unlist(distanceMethodList)
print(dist_methods)
## UniFrac1 UniFrac2 DPCoA JSD vegdist1
## "unifrac" "wunifrac" "dpcoa" "jsd" "manhattan"
## vegdist2 vegdist3 vegdist4 vegdist5 vegdist6
## "euclidean" "canberra" "bray" "kulczynski" "jaccard"
## vegdist7 vegdist8 vegdist9 vegdist10 vegdist11
## "gower" "altGower" "morisita" "horn" "mountford"
## vegdist12 vegdist13 vegdist14 vegdist15 betadiver1
## "raup" "binomial" "chao" "cao" "w"
## betadiver2 betadiver3 betadiver4 betadiver5 betadiver6
## "-1" "c" "wb" "r" "I"
## betadiver7 betadiver8 betadiver9 betadiver10 betadiver11
## "e" "t" "me" "j" "sor"
## betadiver12 betadiver13 betadiver14 betadiver15 betadiver16
## "m" "-2" "co" "cc" "g"
## betadiver17 betadiver18 betadiver19 betadiver20 betadiver21
## "-3" "l" "19" "hk" "rlb"
## betadiver22 betadiver23 betadiver24 dist1 dist2
## "sim" "gl" "z" "maximum" "binary"
## dist3 designdist
## "minkowski" "ANY"
Remove distance methods that depend on a phylogenetic tree, because the enterotype
object that we started with does not have one. These would be the two unifrac methods and DPCoA.
# Remove methods that need a phylogenetic tree
dist_methods[(1:3)]
## UniFrac1 UniFrac2 DPCoA
## "unifrac" "wunifrac" "dpcoa"
dist_methods <- dist_methods[-(1:3)]
print(dist_methods)
## JSD vegdist1 vegdist2 vegdist3 vegdist4
## "jsd" "manhattan" "euclidean" "canberra" "bray"
## vegdist5 vegdist6 vegdist7 vegdist8 vegdist9
## "kulczynski" "jaccard" "gower" "altGower" "morisita"
## vegdist10 vegdist11 vegdist12 vegdist13 vegdist14
## "horn" "mountford" "raup" "binomial" "chao"
## vegdist15 betadiver1 betadiver2 betadiver3 betadiver4
## "cao" "w" "-1" "c" "wb"
## betadiver5 betadiver6 betadiver7 betadiver8 betadiver9
## "r" "I" "e" "t" "me"
## betadiver10 betadiver11 betadiver12 betadiver13 betadiver14
## "j" "sor" "m" "-2" "co"
## betadiver15 betadiver16 betadiver17 betadiver18 betadiver19
## "cc" "g" "-3" "l" "19"
## betadiver20 betadiver21 betadiver22 betadiver23 betadiver24
## "hk" "rlb" "sim" "gl" "z"
## dist1 dist2 dist3 designdist
## "maximum" "binary" "minkowski" "ANY"
# This is the unser-defined method:
dist_methods["designdist"]
## designdist
## "ANY"
# Remove it
dist_methods <- dist_methods[-which(dist_methods == "ANY")]
print(dist_methods)
## JSD vegdist1 vegdist2 vegdist3 vegdist4
## "jsd" "manhattan" "euclidean" "canberra" "bray"
## vegdist5 vegdist6 vegdist7 vegdist8 vegdist9
## "kulczynski" "jaccard" "gower" "altGower" "morisita"
## vegdist10 vegdist11 vegdist12 vegdist13 vegdist14
## "horn" "mountford" "raup" "binomial" "chao"
## vegdist15 betadiver1 betadiver2 betadiver3 betadiver4
## "cao" "w" "-1" "c" "wb"
## betadiver5 betadiver6 betadiver7 betadiver8 betadiver9
## "r" "I" "e" "t" "me"
## betadiver10 betadiver11 betadiver12 betadiver13 betadiver14
## "j" "sor" "m" "-2" "co"
## betadiver15 betadiver16 betadiver17 betadiver18 betadiver19
## "cc" "g" "-3" "l" "19"
## betadiver20 betadiver21 betadiver22 betadiver23 betadiver24
## "hk" "rlb" "sim" "gl" "z"
## dist1 dist2 dist3
## "maximum" "binary" "minkowski"
Now we can loop through this list, calculate distances according to each methods, plot the distances, and save each plot to a plist (plist
).
# Prepare to receive plots
plist <- vector("list", length(dist_methods))
names(plist) <- dist_methods
for(i in dist_methods ){
# Calculate distance matrix
iDist <- distance(enterotype, method=i)
#Calculate ordination
iMDS <- ordinate(enterotype, "MDS", distance = iDist)
## Make plot
# Don't carry over previous plot (if errror, p will be blank)
p <- NULL
# Create plot, store as temp variable p
p <- plot_ordination(enterotype, iMDS, color="SeqTech", shape="Enterotype")
# Add title to each plot
p <- p + ggtitle(paste("MDS using distance method ", i, sep=""))
# Save the graphic to file
plist[[i]] <- p
}
## Warning in vegdist(structure(c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, :
## results may be meaningless with non-integer data in method "morisita"
## Warning in vegdist(structure(c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, :
## results may be meaningless with non-integer data in method "chao"
## Warning in vegdist(structure(c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, :
## results may be meaningless with non-integer data in method "cao"
Shade according to sequencing technology:
# Shade according to sequencing technology:
df <- ldply(plist, function(x) x$data)
names(df)[1] <- "distance"
p <- ggplot(df, aes(Axis.1, Axis.2, color=SeqTech, shape=Enterotype))
p <- p + geom_point(size = 3, alpha = 0.5)
p <- p + facet_wrap(~distance, scales="free")
p <- p + ggtitle("MDS on various distance metrics for Enterotype dataset")
p
## Warning: Removed 387 rows containing missing values (geom_point).
Now, shade instead according to assigned Enterotype
# Shade according to Enterotype:
df <- ldply(plist, function(x) x$data)
names(df)[1] <- "distance"
p <- ggplot(df, aes(Axis.1, Axis.2, color=Enterotype, shape=SeqTech))
p <- p + geom_point(size = 3, alpha = 0.5)
p <- p + facet_wrap(~distance, scales="free")
p <- p + ggtitle("MDS on various distance metrics for Enterotype dataset")
p
Some selected examples among the created plots. First the Jensen-Shannon Divergence.
# Plot ordination with Jensen-Shannon Divergence
print(plist[["jsd"]])
Jaccard distance
# Plot ordination with Jaccard distance
print(plist[["jaccard"]])
Bray-Curtis
# Plot ordination with Bray-Curtis
print(plist[["bray"]])
Gower
# Plot ordination with Gower
print(plist[["gower"]])
w
# Plot ordination with w
print(plist[["w"]])
The clusGap
function from the cluster
package calculates a goodness of clustering measure, the “gap” statistic. For each number of clusters k
, it compares W(k) with \(E[W(k)]\) where the latter is defined by bootstrapping.
Let’s give this a try:
Well, really, first, let’s set things up, then ordinate:
# Set up
library(cluster)
packageVersion("cluster")
## [1] '2.0.6'
theme_set(theme_bw())
data(enterotype)
# Now ordinate
exord <- ordinate(enterotype, method = "MDS", distance = "jsd")
# Compute gap statistic
paml = function(x, k){list(cluster=pam(x,k, cluster.only = TRUE))}
x = phyloseq:::scores.pcoa(exord, distplay="sites")
gskmn = clusGap(x[,1:2], FUN=paml, K.max = 6, B=50)
gskmn
## Clustering Gap statistic ["clusGap"] from call:
## clusGap(x = x[, 1:2], FUNcluster = paml, K.max = 6, B = 50)
## B=50 simulated reference sets, k = 1..6; spaceH0="scaledPCA"
## --> Number of clusters (method 'firstSEmax', SE.factor=1): 4
## logW E.logW gap SE.sim
## [1,] 2.995599 3.113834 0.1182350 0.01318478
## [2,] 2.209852 2.765177 0.5553256 0.01801170
## [3,] 1.922188 2.585339 0.6631506 0.02467888
## [4,] 1.685798 2.413418 0.7276203 0.02790721
## [5,] 1.601025 2.281744 0.6807198 0.02357870
## [6,] 1.480640 2.184950 0.7043098 0.02517585
Define a plot method for results
# Define plot method
plot_glusgap <- function(clusgap, title="Gap Statistic calculation results"){
require("ggplot2")
gstab <- data.frame(clusgap$Tab, k=1:nrow(clusgap$Tab))
p <- ggplot(gstab, aes(k, gap)) + geom_line() + geom_point(size=5)
p <- p + geom_errorbar(aes(ymax = gap+SE.sim, ymin = gap - SE.sim))
p <- p + ggtitle(title)
return(p)
}
# Define a wrapper function
gap_statistic_ordination <- function(ord, FUNcluster, type="sites", K.max=6, axes=c(1:2), B=500, verbose=interactive(), ...){
require("cluster")
# If "paml" was chosen, use this internally defined call to pam
if(FUNcluster == "paml"){
FUNcluster = function(x,k) list(cluster = pam(x, k, cluster.only=TRUE))
}
# Use the scores function to get the ordination coordinates
x = phyloseq:::scores.pcoa(ord, display=type)
# If axes not explicitly defined (NULL), then use all of them
if(is.null(axes)){axes = 1:ncol(x)}
# Finally, perform, and return, the gap statistic calculation using cluster::clusGap
clusGap(x[, axes], FUN=FUNcluster, K.max=K.max, B=B, verbose=verbose, ...)
}
Now try out this function
# Make the plot
gs <- gap_statistic_ordination(exord, "paml", B=50, verbose = FALSE)
print(gs, method="Tibs2001SEmax")
## Clustering Gap statistic ["clusGap"] from call:
## clusGap(x = x[, axes], FUNcluster = FUNcluster, K.max = K.max, B = B, verbose = verbose)
## B=50 simulated reference sets, k = 1..6; spaceH0="scaledPCA"
## --> Number of clusters (method 'Tibs2001SEmax', SE.factor=1): 4
## logW E.logW gap SE.sim
## [1,] 2.995599 3.117218 0.1216190 0.02362545
## [2,] 2.209852 2.772969 0.5631176 0.02189187
## [3,] 1.922188 2.584820 0.6626322 0.02293463
## [4,] 1.685798 2.422742 0.7369441 0.02343545
## [5,] 1.601025 2.284250 0.6832254 0.02237319
## [6,] 1.480640 2.176227 0.6955870 0.02264802
plot_clusgap(gs)
For comparison, let’s see what happens if we only use plotting as available in base
R:
# Base R plotting
plot(gs, main = "Gap statistic for the 'Enterotypes' data")
mtext("Looks like 4 clusters is best, with 3 and 5 as close runners-up")
We want to filter low-occurrence, poorly-represented OTUs from these data, because they are essentially noise variables for purposes of this lesson. In practice, you should probably perform clearly-document well-justified pre-processing steps.
First we want to remove OTUs that do not appear more than 5 times in more than half the samples.
# Remove OTUs that don't appear more than 5 times in more than half the samples
GP <- GlobalPatterns
wh0 <- genefilter_sample(GP, filterfun_sample(function(x) x>5), A = 0.5*nsamples(GP))
GP1 <- prune_taxa(wh0, GP)
Transform to an even sampling depth.
# Transform to an even sampling depth
GP1 <- transform_sample_counts(GP1, function(x) 1e6 * x/sum(x))
Keep only the 5 most abundant phyla.
# Keep the top 5 most abundant phyla
phylum.sum <- tapply(taxa_sums(GP1), tax_table(GP1)[, "Phylum"], sum, na.rm = TRUE)
top5phyla <- names(sort(phylum.sum, TRUE))[1:5]
GP1 <- prune_taxa((tax_table(GP1)[, "Phylum"] %in% top5phyla), GP1)
summary(GP1)
## Length Class Mode
## 1 phyloseq S4
GP1
## phyloseq-class experiment-level object
## otu_table() OTU Table: [ 204 taxa and 26 samples ]
## sample_data() Sample Data: [ 26 samples by 7 sample variables ]
## tax_table() Taxonomy Table: [ 204 taxa by 7 taxonomic ranks ]
## phy_tree() Phylogenetic Tree: [ 204 tips and 203 internal nodes ]
So we still have 204 taxa that are in the dataset GP1
that we will use going forward.
We will want to investigate human-associated microbiomes vs not, so we will define as follows:
# Define human-associate samples
human <- get_variable(GP1, "SampleType") %in% c("Feces", "Mock", "Skin", "Tongue")
sample_data(GP1)$human <- factor(human)
There are four basic representations of an ordination.
Let’s plot just the OTUs and shade by Phylum. Remember there will ntaxa(GP1)
= 204 OTUs in this plot.
# Plot ordination for just OTUs
GP.ord <- ordinate(GP1, "NMDS", "bray")
## Square root transformation
## Wisconsin double standardization
## Run 0 stress 0.1333468
## Run 1 stress 0.2536105
## Run 2 stress 0.1449079
## Run 3 stress 0.1570722
## Run 4 stress 0.1556499
## Run 5 stress 0.1465222
## Run 6 stress 0.1333468
## ... Procrustes: rmse 4.950622e-06 max resid 1.535051e-05
## ... Similar to previous best
## Run 7 stress 0.1521722
## Run 8 stress 0.1385323
## Run 9 stress 0.1637671
## Run 10 stress 0.1648288
## Run 11 stress 0.3713979
## Run 12 stress 0.1333468
## ... Procrustes: rmse 2.147202e-05 max resid 6.362179e-05
## ... Similar to previous best
## Run 13 stress 0.1385323
## Run 14 stress 0.1333469
## ... Procrustes: rmse 2.5098e-05 max resid 4.937225e-05
## ... Similar to previous best
## Run 15 stress 0.1542899
## Run 16 stress 0.151017
## Run 17 stress 0.1865419
## Run 18 stress 0.1529686
## Run 19 stress 0.1529609
## Run 20 stress 0.1529686
## *** Solution reached
p1 <- plot_ordination(GP1, GP.ord, type="taxa", color="Phylum", title = "taxa")
print(p1)
There is a lot going on here, and a lot of points are superimposed on one another. One way to deal with that is to use facetting.
# Facet-wrapping for previous plot
p1 + facet_wrap(~Phylum, 3)
Another way to plot is to consider the samples, and shade by “SampleType”, modifying shape according to if they are human associated.
# Plot just the Samples
p2 <- plot_ordination(GP1, GP.ord, type="samples", color="SampleType", shape="human")
p2 + geom_polygon(aes(fill=SampleType)) +
geom_point(size=5) +
ggtitle("samples")
The plot_ordination
function can also automatically create two different graphic layouts in which both the sampels and OTUs are plotted together in one “biplot”. This will not work for methods that are intrinsically samples-only ordinations (e.g., UniFrac/PCoA).
# Biplot Graph
p3 <- plot_ordination(GP1, GP.ord, type="biplot", color="SampleType", shape="Phylum", title = "biplot")
# The following will modify the automatic shape scale
GP1.shape.names <- get_taxa_unique(GP1, "Phylum")
GP1.shapes <- 15:(15 + length(GP1.shape.names) - 1)
names(GP1.shapes) <- GP1.shape.names
GP1.shapes["samples"] <- 16
p3 + scale_shape_manual(values=GP1.shapes)
We have the same problem as before, with a lot of overlay that might obscure meanint. Let’s try to separate into side-by-side panels.
# Split graphic into panels
p4 <- plot_ordination(GP1, GP.ord, type="split", color="Phylum", shape = "human", label = "SampleType", title = "split")
p4
Now we are going o loop through different method
parameter options to the plot_ordination
function, store the plot results, then plot in a combined graphic.
# First use different ordination methods and plot to list
dist <- "bray"
ord_meths <- c("DCA", "CCA", "RDA", "DPCoA", "NMDS", "MDS", "PCoA")
plist <- llply(as.list(ord_meths), function(i, physeq, dist){
ordi <- ordinate(physeq, method=i, distance = dist)
plot_ordination(physeq, ordi, "samples", color="SampleType")
}, GP1, dist)
## Square root transformation
## Wisconsin double standardization
## Run 0 stress 0.1333468
## Run 1 stress 0.1556502
## Run 2 stress 0.1333468
## ... New best solution
## ... Procrustes: rmse 3.31743e-06 max resid 1.001591e-05
## ... Similar to previous best
## Run 3 stress 0.1544493
## Run 4 stress 0.1471619
## Run 5 stress 0.1507771
## Run 6 stress 0.1453581
## Run 7 stress 0.1628707
## Run 8 stress 0.1628707
## Run 9 stress 0.1460271
## Run 10 stress 0.1385323
## Run 11 stress 0.1677824
## Run 12 stress 0.1510167
## Run 13 stress 0.1714274
## Run 14 stress 0.1779273
## Run 15 stress 0.1470663
## Run 16 stress 0.1507161
## Run 17 stress 0.1469145
## Run 18 stress 0.1590019
## Run 19 stress 0.1498724
## Run 20 stress 0.1610051
## *** Solution reached
names(plist) <- ord_meths
Now we will extract the data from each of the individual plots and put them back together in one big data.frame suitable for including all plots in one graphic.
# Extract data from each of the individual plots and put them all together in one big dataframe
pdataframe <- ldply(plist, function(x){
df <- x$data[, 1:2]
colnames(df) <- c("Axis_1", "Axis_2")
return(cbind(df, x$data))
})
names(pdataframe)[1] <- "method"
Now with all ordination results combined in one dataframe, called pdataframe
, we can make a standard faceted scatterplot with ggplot2.
# Make faceted scatterplot
p <- ggplot(pdataframe, aes(Axis_1, Axis_2, color = SampleType, shape=human, fill=SampleType))
p <- p + geom_point(size=4) +
geom_polygon()
p <- p + facet_wrap(~method, scales="free")
p <- p + scale_fill_brewer(type="qual", palette = "Set1")
p <- p + scale_colour_brewer(type="qual", palette = "Set1")
p
If you want t larger version of an individual plot, print from the original plist
from which the pdataframe
was mde. For instance plot the detrended correspondence analysis (DCA), which is the second element of the list.
# Print DCA from original plist
plist[[2]]
Make it look nicer.
#Fluff it up
p <- plist[[2]] + scale_colour_brewer(type="qual", palette = "Set1")
p <- p + scale_fill_brewer(type="qual", palette = "Set1")
p <- p + geom_point(size = 5) +
geom_polygon(aes(fill=SampleType))
p
You can also do PCoA on UniFrac distances
Use the ordinate
function to perform weighted UniFrac then perform PCoA on the distance matrix. Then pass the data and the ordination results to plot_ordination
to create the ggplot2
graphic withe default settings.
# get unifrac distances and pcoa them
ordu <- ordinate(GP1, "PCoA", "unifrac", weighted=TRUE)
# plot it
plot_ordination(GP1, ordu, color="SampleType", shape = "human")
Let’s pretty it up.
# Fluff up the graph
p <- plot_ordination(GP1, ordu, color="SampleType", shape="human")
p <- p + geom_point(size=7)
p <- p + scale_colour_brewer(type="qual", palette = "Set1")
p + ggtitle("MDS/PCoA on weighted UniFrac distance, Global Patterns dataset")
Although the command is to the plot_richness
function, the function in fact calculates to more than to richness, in fact to all descriptions of alpha diversity.
First some setup
# Some setup for graphics
theme_set(theme_bw())
pal <- "Set1"
scale_colour_discrete <- function(palname=pal, ...){
scale_colour_brewer(palette=palname, ...)
}
scale_fill_discrete <- function(palname=pal, ...){
scale_fill_brewer(palette=palname, ...)
}
It is probably not a bad idea to prune OTUs that are not present in any of the samples (somehow that happens) but don’t trim more than that! Many richness estimates are modeled on singletons and doubletons, and you need to leave them in the dataset to get meaningful estimates.
# Prune off species = 0 across all samples
GP <- prune_taxa(taxa_sums(GlobalPatterns) > 0, GlobalPatterns)
The plot_richness
function produces the default graphic. Let’s look at the GP
example dataset.
# Default plot
plot_richness(GP)
## Warning: Removed 130 rows containing missing values (geom_errorbar).
We can, of course, just select a couple of measures to plot.
# Plot only Chao1, Shannon
plot_richness(GP, measures = c("Chao1", "Shannon"))
## Warning: Removed 26 rows containing missing values (geom_errorbar).
We can also specify a sample variable on which to organize the samples along the x-axis. An experimentally meaningful categorical variable is usually a good choice.
# Plot only Chao1, Shannon
plot_richness(GP, x="SampleType", measures = c("Chao1", "Shannon"))
## Warning: Removed 26 rows containing missing values (geom_errorbar).
Suppose we wanted to organize around an external variable that is not in the dataset already. As an example, think of a variable indicating if the sample was human or not. Let’s define that variable first.
# Create indicator variable if sample is human or not.
sample_data(GP)$human <- get_variable(GP, "SampleType") %in% c("Feces", "Mock", "Skin", "Tongue")
Now we tell plot_richness
to map the new human variable on the x-axis, and shaade the points in different color groups accoring to SampleType.
# Use indicator for human as x-axis, shade according to sample type
plot_richness(GP, x="human", color="SampleType", measures = c("Chao1", "Shannon"))
## Warning: Removed 26 rows containing missing values (geom_errorbar).
We can merge samples that are from the environment (SampleType) and make the points bigger with ggplot2
. First, merge samples.
# Merge samples
GPst <- merge_samples(GP, "SampleType")
# Repair variables damaged during merge (coerced to numeric)
sample_data(GPst)$SampleType <- factor(sample_names(GPst))
sample_data(GPst)$human <- as.logical(sample_data(GPst)$human)
Now to plot
p <- plot_richness(GPst, x="human", color="SampleType", measures = c("Chao1", "Shannon"))
p <- p + geom_point(size=5, alpha=0.7)
p
## Warning: Removed 9 rows containing missing values (geom_errorbar).
The phyloseq
package also has powerful graphics for displaying and manipulating phylogenetic trees with the plot_tree
function.
We want to plot trees, but we notice that the node labels for the GlobalPatterns dataset are, well, strange. Here they are:
# Node labels for phylogenetic tree of GlobalPatterns
head(phy_tree(GlobalPatterns)$node.label, 10)
## [1] "" "0.858.4" "1.000.154" "0.764.3" "0.995.2"
## [6] "1.000.2" "0.943.7" "0.971.6" "0.766" "0.611"
These might be bootstrap values, but what is with the two decimals?
Let’s just take the first 4 characters.
# Take first 4 characters of labels
phy_tree(GlobalPatterns)$node.label <- substr(phy_tree(GlobalPatterns)$node.label, 1, 4)
head(phy_tree(GlobalPatterns)$node.label)
## [1] "" "0.85" "1.00" "0.76" "0.99" "1.00"
Much better, these look like bootstrap values now.
The dataset has many OTUs, more than will fit nicely on a tree graphic.
# Number of taxa in dataset
ntaxa(GlobalPatterns)
## [1] 19216
Let’s arbitarily prune to the first 50 OTUs in the dataset and store.
# Prune to first 50 OTUs
physeq <- prune_taxa(taxa_names(GlobalPatterns)[1:50], GlobalPatterns)
If we just use the default plot_tree
setting, look at what we get with this pruned tree.
# Plot the tree
plot_tree(physeq)
Kind of ugly. We can do beter.
The black dots next to the tips are one for each sample in which that OTU was observed. Some have more dots than others. The node labels are also next to the tips, making everything pretty messy. How can we clean this up?
First a tree with labels only.
# Plot tree, no points
plot_tree(physeq, "treeonly")
What about without the labels?
# Plot with no points and no labels
plot_tree(physeq, "treeonly", nodeplotblank)
We can adjust the way the branches are rotated to improve the looks using the ladderize
parameter.
# Ladderize the tree above
plot_tree(physeq, "treeonly", nodeplotblank, ladderize="left")
plot_tree(physeq, "treeonly", nodeplotblank, ladderize = TRUE)
Now let’s add the OTU labels back to the tips?
# Add tip labels back to tree above
plot_tree(physeq, nodelabf=nodeplotblank, label.tips="taxa_names", ladderize="left")
Whoa, we also added back the points. Let’s take those away.
# Tree above with labels, without points
plot_tree(physeq, "anythingelse")
For the default argument to the method parameter, “sampledodge”, a point is added next to each OTU tip for every sample in which that OTU was observed. Now we can add aesthetics to these points.
Let’s map color to the type of sample collected (location)
# Map color to type of sample collected
plot_tree(physeq, nodelabf = nodeplotboot(), ladderize = "left", color = "SampleType")
Or we can color according to the taxonomic level of Class
# Color according to taxonomic Class
plot_tree(physeq, nodelabf = nodeplotboot(), ladderize = "left", color = "Class")
We can also map to a point shape if the variable has 6 or less categories, even when color is also mapped. Let’s add point shape to indicate taxonomic Class while color indicates SampleType.
# add shape to points
plot_tree(physeq, nodelabf = nodeplotboot(), ladderize = "left", color = "SampleType", shape = "Class")
Making a radial tree is surprisingly easy with ggplot2
, as long as you remember that our vertically oriented tree is a cartesian mapping of the data to a graphic. A radial tree is the same mapping, but with polar, instead of cartesian, coordinates. To wit
# Make radial phylogenetic tree for esophagus dataset
data (esophagus)
plot_tree(esophagus, color = "Sample", ladderize = "left") + coord_polar(theta = "y")
We can make a more elaborate tree with the GlobalPatterns dataset, but will need to do some preliminary loading and trimming.
# Make radial tree for GlobalPatterns
plot_tree(physeq, nodelabf = nodeplotboot(60, 60, 3), color = "SampleType", shape = "Class", ladderize = "left") + coord_polar(theta = "y")
The Esophagus dataset is really simple - just 3 samples, no sample data, and a modest amount of sequencing per sample.
Let’s look at the default tree
# Default tree for Esophagus dataset
plot_tree(esophagus, title="Default Tree.")
If you want an unadorned tree, the treeonly
method can be used.
# Simplest tree
plot_tree(esophagus, "treeonly", title="method = \"treeonly\"")
Shading tips where OTUs are observed:
# Shade tips where OTUs are observed
plot_tree(esophagus, color="samples")
Scale the tips according to abundance.
# Scale tips by abundance
plot_tree(esophagus, size="abundance")
Color and scale on the same tree!
# Shade tips where OTUs are observed
plot_tree(esophagus, color="samples", size = "abundance")
Oh my!
There is some overlap of tip points. Let’s see if we can adjust spacing to spread them out a bit.
# Shade tips where OTUs are observed
plot_tree(esophagus, color="samples", size = "abundance", base.spacing = 0.03)
Can we display the actual numeric value of abundances that occurred 3 or more times in a sample?
# Shade tips where OTUs are observed
plot_tree(esophagus, color="samples", size = "abundance", base.spacing = 0.03, min.abundance = 3)
Let’s subset the GlobalPatterns dataset to just the observed Archaea.
# Subset GlobalPatterns to just Archaea
gpa <- subset_taxa(GlobalPatterns, Kingdom == "Archaea")
ntaxa(gpa)
## [1] 208
The number of taxa is small enough for a decent tree plot.
Too many OTUs means a tree that is pointless to displa in its entirety in one graphic of a standard size and font. So the whole GlobalPatterns dataset is probably a bad idea.
# Number of taxa in GlobalPatterns dataset
ntaxa(GlobalPatterns)
## [1] 19216
Let’s look at the basic tree with just a minimal set of parameter choices.
# Almost minimal trees for GlobalPatterns Archaea subset
plot_tree(gpa, color = "SampleType")
plot_tree(gpa, color = "Phylum")
plot_tree(gpa, color = "SampleType", shape = "Phylum")
plot_tree(gpa, color = "Phylum", label.tips = "Genus")
These ~200 taxa still make for a crowded graphic. So let’s subset further to just the Crenarcheota.
# Trees for just Crenarchaeota
gpac <- subset_taxa(gpa, Phylum=="Crenarchaeota")
plot_tree(gpac, color="SampleType", shape="Genus")
plot_tree(gpac, color = "SampleType", label.tips = "Genus")
Add some abundance information, then spread it out:
# Add abundance inforamtion then spread it out
plot_tree(gpac, color="SampleType", shape="Genus", size="abundance", plot.margin=0.4)
plot_tree(gpac, nodelabf=nodeplotblank, color="SampleType", shape="Genus", size="abundance", base.spacing=0.04, plot.margin=0.4)
Let’s use some of these same techniquest with the Chlamydiae-only tree that we had created earlier.
# Create Chlamydiae-only tree and plot again
gp.ch <- subset_taxa(GlobalPatterns, Phylum=="Chlamydiae")
plot_tree(gp.ch, color="SampleType", shape="Family", label.tips="Genus", size="abundance", plot.margin=0.6)
Let’s start again with the Chlamydiae-only subset of GlobalPatterns, and look at the basic bar plot.
# Basic plots with Chlamydiae subset
plot_bar(gp.ch)
Add some fill color to represent the Genus to which the OTUs belong
# Add fill color for Genus to basic bar plot
plot_bar(gp.ch, fill = "Genus")
Keep same fill color, but group by SampleType.
# Add fill color for Genus to basic bar plot
plot_bar(gp.ch, fill = "Genus", x = "SampleType")
Organize further with facets, graphing the Families faceted by SampleType and filled by Genus.
# Add fill color for Genus to basic bar plot
plot_bar(gp.ch, "Family", fill = "Genus", facet_grid = ~SampleType)
Even fancier by adding a layer of jittered points.
# Add a layer of jittered points
p = plot_bar(gp.ch, "Family", fill="Genus", facet_grid=~SampleType)
p + geom_point(aes(x=Family, y=Abundance), color="black", position="jitter", size=3)
Let’s first trim to the 10 most abundant genera.
TopNOTUs <- names(sort(taxa_sums(enterotype), TRUE)[1:10])
ent10 <- prune_taxa(TopNOTUs, enterotype)
The settings in the following R-chunk were chosen after trial-and-error, looking to show off certain features of the data to the best advantage.
# Bar graph for Enterotype to look at sequencing technology v genus detection
plot_bar(ent10, "SeqTech", fill="Enterotype", facet_grid = ~Genus)
You could of course opt instead to separate enterotype designation and genus designation:
# Separate into finer grid
plot_bar(ent10, "Genus", fill="Genus", facet_grid = SeqTech~Enterotype)
We can remove those OTU separation lines as follows:
# Separate into finer grid
p <- plot_bar(ent10, "Genus", fill="Genus", facet_grid = SeqTech~Enterotype)
p + geom_bar(aes(color=Genus, fill=Genus), stat="identity", position="stack")
You can also create heatmaps using ordination methods to organize rows and columns instead of hierarchical clustering. Such an organization typically makes a more interpretable display. Lucky for us we can do all of this within phyloseq
!
Traditionally heatmaps have been used to emphasize data that is above or below a threshold as “hot” or “cold” respectively. But when used with OTU abundance data, the need is to see the relative patterns of high abundance OTUs against a background of low abundance or absent groups. Thus the default color scheme is dark blue (low abundance) to very light blue for highest abundance, with black representing missing or zero abundance values. Thus you can change the color scheme by changing the low
, high
, and na.value
arguments.
et’s get the top 300 most abundant Bacteria taxa across all samples in the GlobalPatterns dataset and plot a heatmap. Quick and dirty.
# Get top 300 most abundant Bacteria taxa in GlobalPatterns
gpt <- subset_taxa(GlobalPatterns, Kingdom=="Bacteria")
gpt <- prune_taxa(names(sort(taxa_sums(gpt),TRUE)[1:300]), gpt)
# Plot the heatmap
plot_heatmap(gpt, sample.label="SampleType")
## Warning: Transformation introduced infinite values in discrete y-axis
This is kind of noisy. Let’s deal with something a bit more manageable, like the Crenarchaeota phylum of Archaea, which we created above as gpac. Now the default heatmap
# Default heatmap for Crenarchaeota
plot_heatmap(gpac)
## Warning: Transformation introduced infinite values in discrete y-axis
Re-label by a sample variable and taxonomic family.
# Re-label and SampleType and Family
p <- plot_heatmap(gpac, "NMDS", "bray", "SampleType", "Family")
p
## Warning: Transformation introduced infinite values in discrete y-axis
And if you wanted axis lables but not hte labels on individual features?
# Relabel
p$scales$scales[[1]]$name <- "My X-Axis"
p$scales$scales[[2]]$name <- "My Y-Axis"
print(p)
## Warning: Transformation introduced infinite values in discrete y-axis
How about a different color scheme?
plot_heatmap(gpac, "NMDS", "bray", "SampleType", "Family", low="#000033", high="#CCFF66")
## Warning: Transformation introduced infinite values in discrete y-axis
Or a very dark blue to red scheme:
plot_heatmap(gpac, "NMDS", "bray", "SampleType", "Family", low="#000033", high="#FF3300")
## Warning: Transformation introduced infinite values in discrete y-axis
Dark blue to light blue:
plot_heatmap(gpac, "NMDS", "bray", "SampleType", "Family", low="#000033", high="#66CCFF")
## Warning: Transformation introduced infinite values in discrete y-axis
Here is a “dark on light” color scheme, with the white value for NA and 0 elements.
plot_heatmap(gpac, "NMDS", "bray", "SampleType", "Family", low="#66CCFF", high="#000033", na.value = "white")
## Warning: Transformation introduced infinite values in discrete y-axis
A somewhat similar color scheme to that above, but now the “near-zero” color is close to a cream color and the colors themselves closer to grey. Better contrast overall, but more subtle in a way, not as exciting.
plot_heatmap(gpac, "NMDS", "bray", "SampleType", "Family", low="#FFFFCC", high="#000033", na.value = "white")
## Warning: Transformation introduced infinite values in discrete y-axis
Now let’s try with different distances, different ordination methods
First the default heatmap with NMDS ordination on the jaccard distance.
plot_heatmap(gpac, "NMDS", "jaccard")
## Warning: Transformation introduced infinite values in discrete y-axis
Detrended correspondence analysis.
plot_heatmap(gpac, "DCA", "none", "SampleType", "Family")
## Warning: Transformation introduced infinite values in discrete y-axis
Unconstrained redundancy analysis (PCA by any other name)
plot_heatmap(gpac, "RDA", "none", "SampleType", "Family")
## Warning: Transformation introduced infinite values in discrete y-axis
PCoA/MDS on the (default) Bray-Curtis distance.
plot_heatmap(gpac, "PCoA", "bray", "SampleType", "Family")
## Warning: Transformation introduced infinite values in discrete y-axis
PCoA/MDS on unweighted UniFrac distance
plot_heatmap(gpac, "PCoA", "unifrac", "SampleType", "Family")
## Warning: Transformation introduced infinite values in discrete y-axis
Same as above but with weighted UniFrac.
plot_heatmap(gpac, "PCoA", "unifrac", "SampleType", "Family", weighted=TRUE)
## Warning: Transformation introduced infinite values in discrete y-axis
And finally, just the regular heatmap produced by base-R graphics with a hierarchical clustering. Just in case you wanted to compare with plot_heatmap
. Because I know you want to.
heatmap(otu_table(gpac))
There is a random element to network layout methods. So we want this to be reproducible, so we will set the random number generator seed explicitly:
# Set the seed
set.seed(711L)
We are going to work first with the Enterotype dataset. We will first want to remove the 9 samples for which no enterotype designation was assigned.
# Remove samples with no enterotype designation
enterotype <- subset_samples(enterotype, !is.na(Enterotype))
We are going to create a network of samples in this dataset, connecting two samples when a distance between them is below some threshold.
Let’s plot first with the default settings.
# Default microbiome network plot
plot_net(enterotype, maxdist = 0.4, point_label="Sample_ID")
What do you notice here, other than a big mess?
Well there are two subgraphs that make up the majority of samples.
Let’s use some color to see about aspects of the network.
# Exercise some options
plot_net(enterotype, maxdist = 0.3, color="SeqTech", shape="Enterotype")
What do you notice now?