Machine learning for Genome-Wide Association Studies (GWAS) is still in its infancy and no software suite we know of currently exists. MachineLearningGWAS
is a package for R meant to meet this need. It is freely available at . This package depends on the following R packages: caret
for machine learning, ggplot2
for plotting and data.table
for data manipulation. It also depends on various Bioconductor packages. Briefly, it provides an interface to:
Import VCF file and clinical data into an R S4 object named MDT
and based on the data.table
package;
Annotate variants using Bioconductor annotations packages;
Estimate different statistics for variants such as p-values for Hardy-Weinberg equilibrium tests or association tests as well as Minor Allele Frequency (MAF).
Filter out variants or samples;
Use this data to train classifiers using caret
package, and organize results (classifiers, predictions, errors, feature importance) in one R S4 object named MLGWAS
and based on the data.table
package;
Plot and analyze genomics and classification data using the ggplot2
package.
To install the MachineLearningGWAS
, install R and RStudio, and then run the following code in the R console:
# install.packages("devtools")
devtools::install_github("olivmrtn/MachineLearningGWAS")
To load package to namespace, run the following code:
library(MachineLearningGWAS)
citation("MachineLearningGWAS")
To cite MachineLearningGWAS in publications use:
Olivier M. F. Martin (2016). MachineLearningGWAS: Machine
Learning for Genome-Wide Association Studies URL
https://www.github.com/olivmrtn/MachineLearningGWAS/
A BibTeX entry for LaTeX users is
@Manual{,
title = {{MachineLearningGWAS}: Machine Learning for Genome-Wide Association Studies},
author = {Olivier M. F. Martin},
year = {2016},
date = {1996},
url = {https://www.github.com/olivmrtn/MachineLearningGWAS/},
}
Throughout this tutorial, we will use simulated data. Genomics data is the VCF file snps.vcf
file and clinical data is tab-delimited file clinical.txt
. These can be retrieved using the following commands.
genotype_file <- system.file("extdata", "snps.vcf",
package = "MachineLearningGWAS")
clinical_file <- system.file("extdata", "clinical.txt",
package = "MachineLearningGWAS")
Briefly, the genomics data is composed of 200 samples and 1000 SNPs, out of which only 5 of them are truly associated with the response. These are indicated in the AS
field with a 1
. Genotypes are in the GT
field. The data is also comprised of dbSNP rs# identifier, and quality QUAL
and sequencing depth DP
.
Clinical data specifies the disease status of the patient (RESPONSE
), as well as age (AGE
), sex (SEX
) and patient identifier (SampleID
).
Currently, it is only possible to import genomics data as a list
of VCF
objects of the VariantAnnotation
package. This can be done using the readVcf
function. Alternatively, if your data is divided into more than one VCF files, you can use the readVcfDir
function in this package.
vcf <- VariantAnnotation::readVcf(genotype_file)
mdt <- vcfsToMDT(list(vcf))
Clinical data can be imported using the importPhenotype
function. The response_type
argument specifies if the RESPONSE
should be treated as a categorical or continuous variable and sep
specifies the file separator.
mdt <- importPhenotype(mdt, clinical_file, response_type = "factor", sep = "\t")
Printing the objects shows its contents.
mdt
--- MDT object ---
mtable(x): 200 individuals and 1000 features
annotations(x): 1000 rows in 5 columns
info(x): 200000 rows in 7 columns
response(x): factor 0: 101 | 1: 99
phenotype(x): 200 rows in 4 columns
An MDT
object is composed of four main data.table
: mtable
, annotations
, info
and phenotype
. These are interlinked by two identifiers (i.e. keys): SampleID
and FeatureID
.
SampleID
s and FeatureID
s can be retrieve using samples
and features
.
head(MachineLearningGWAS::samples(mdt))
[1] "Sample1" "Sample10" "Sample100" "Sample101" "Sample102" "Sample103"
head(MachineLearningGWAS::features(mdt))
[1] "1" "10" "100" "1000" "101" "102"
The mtable
data.table
contains genotypes coded as integers 0, 1 and 2 with the following meanings: 0: homozygote for the reference allele; 1: heterozygote; 2: homozygote for the alternative allele. Superior values correspond to other types of genotypes.
head(mtable(mdt))
FeatureID SampleID VALUE
1: 1 Sample1 1
2: 1 Sample10 0
3: 1 Sample100 0
4: 1 Sample101 0
5: 1 Sample102 0
6: 1 Sample103 1
The annotations
data.table
contains annotations about features. This can be expanded using the annotate* functions as we will in the next section.
head(annotations(mdt))
FeatureID CHROM POS STRAND REF ALT
1: 1 19 295180 * C T
2: 10 19 1615315 * C T
3: 100 19 18059681 * A G
4: 1000 22 51181851 * C T
5: 101 19 18094546 * A G
6: 102 19 18222131 * C T
The info
data.table
contains annotations that are dependent on features and samples such as quality measures (QUAL
) or genotypes (GT
).
head(MachineLearningGWAS::info(mdt))
FeatureID SampleID paramRangeID QUAL FILTER INFO_NS INFO_DP
1: 1 Sample1 NA 46.01215 . 200 274
2: 1 Sample10 NA 46.01215 . 200 274
3: 1 Sample100 NA 46.01215 . 200 274
4: 1 Sample101 NA 46.01215 . 200 274
5: 1 Sample102 NA 46.01215 . 200 274
6: 1 Sample103 NA 46.01215 . 200 274
INFO_AS GT
1: 0 0/1
2: 0 0/0
3: 0 0/0
4: 0 0/0
5: 0 0/0
6: 0 0/1
The phenotype
data.table
contains clinical data about samples as imported by importPhenotype
.
head(phenotype(mdt))
SampleID RESPONSE AGE SEX
1: Sample1 1 69 1
2: Sample10 0 66 1
3: Sample100 1 69 0
4: Sample101 1 61 0
5: Sample102 0 62 0
6: Sample103 1 80 1
All annotations functions can be found by typing help(annotateMDT)
in the R console. These functions append columns to the annotations
data.table
.
For example, we can append an univariate independence test and Hardy-Weinberg (HW) equilibrium p-values, as well as Minor Allele Frequency (MAF).
## Annotate Fisher's test and HW equilibrium p-values
mdt <- annotateStat(mdt, function_list = list(FISHER = function(x, y) {
stats::fisher.test(table(factor(x, c(0L, 1L, 2L)),
factor(y, c(0L, 1L))))$p.value },
HW = function(x, y) {
HardyWeinberg::HWExact(table(factor(x, c(0L, 1L, 2L))),
verbose = FALSE)$pval
}),
fill = NA, verbose = TRUE)
# Annotate MAF
mdt <- annotateMAF(mdt)
Entrez identifiers can be added using the annotateLocation
function. This function is dependant on a TxDb
object from Bioconductor. Here, we will be using the TxDb.Hsapiens.UCSC.hg19.knownGene
package.
library(TxDb.Hsapiens.UCSC.hg19.knownGene)
mdt <- annotateLocation(mdt, TxDb.Hsapiens.UCSC.hg19.knownGene)
These Entrez identifiers can be used to query different Bioconductor annotation databases such as org.Hs.eg.db
using annotateBioconductor
. Here under we append gene names and associated Gene Ontology (GO) terms.
library(org.Hs.eg.db)
mdt <- annotateBioconductor(mdt,
annotations_db = org.Hs.eg.db,
keys_colname = "ENTREZID",
columns = c("GENENAME", "GO"))
Finally, annotateCoding
allows annotating consequence of mutations and annotateRSID
to annotate dbSNP rs identifiers.
library(BSgenome.Hsapiens.UCSC.hg19)
library(SNPlocs.Hsapiens.dbSNP144.GRCh37)
mdt <- annotateCoding(mdt,
BSgenome.Hsapiens.UCSC.hg19,
TxDb.Hsapiens.UCSC.hg19.knownGene)
mdt <- annotateRSID(mdt, SNPlocs.Hsapiens.dbSNP144.GRCh37)
Let’s see results.
head(annotations(mdt))
FeatureID ENTREZID CHROM POS STRAND REF ALT FISHER HW
1: 1 NA 19 295180 * C T 0.2541678 0.05783275
2: 10 6929 19 1615315 * C T 0.6448435 0.30891569
3: 10 6929 19 1615315 * C T 0.6448435 0.30891569
4: 10 6929 19 1615315 * C T 0.6448435 0.30891569
5: 10 6929 19 1615315 * C T 0.6448435 0.30891569
6: 10 6929 19 1615315 * C T 0.6448435 0.30891569
MAF LOCATION GENENAME GO EVIDENCE ONTOLOGY
1: 0.0900 intergenic NA NA NA NA
2: 0.2975 intron transcription factor 3 GO:0000122 IDA BP
3: 0.2975 intron transcription factor 3 GO:0000790 IDA CC
4: 0.2975 intron transcription factor 3 GO:0000978 IDA MF
5: 0.2975 intron transcription factor 3 GO:0001078 IDA MF
6: 0.2975 intron transcription factor 3 GO:0002326 NAS BP
CONSEQUENCE_ALT CONSEQUENCE RSID
1: NA NA rs547079193
2: T synonymous rs147690653
3: T synonymous rs147690653
4: T synonymous rs147690653
5: T synonymous rs147690653
6: T synonymous rs147690653
We will present five functions allowing us to manipulate MDT
objects: selectMDT
, aggregateMDT,
[,
filterMDT, and
fillMDT`.
aggregateMDT
aggregate FeatureID
s into Higher Level IDs, for example, Entrez identifiers or GO terms. It requires a data.frame
specifying how to aggregate FeatureID
s. The first column is the old FeatureID
, the second is the new FeatureID
. These can be easily retrieved using selectMDT
.
df <- selectMDT(mdt, MachineLearningGWAS::features(mdt), "ENTREZID")
head(df)
FeatureID ENTREZID
1: 10 6929
2: 1000 49
3: 101 3780
4: 102 23031
5: 103 23031
6: 104 23031
This data.frame
can be used to aggregate the MDT
objects. As you can see, the FeatureID
corresponds to Entrez identifiers.
ag_mdt <- aggregateMDT(mdt, df, fun.aggregate = mean)
ag_mdt
--- MDT object ---
mtable(x): 200 individuals and 481 features
annotations(x): 481 rows in 0 columns
info(x): 96200 rows in 0 columns
response(x): factor 0: 101 | 1: 99
phenotype(x): 200 rows in 4 columns
head(mtable(ag_mdt))
FeatureID SampleID VALUE
1: 100128496 Sample1 1
2: 100128496 Sample10 1
3: 100128496 Sample100 0
4: 100128496 Sample101 0
5: 100128496 Sample102 1
6: 100128496 Sample103 0
The [
and filterMDT
are subsetting functions. [
subsets according to SampleID
s and FeatureID
s whereas filterMDT
according to some condition and data in one of the data.table
s.
For example, here we keep only FeatureID
s 1 and 2, and SampleID
s Sample1 and Sample2 These MUST be character
s. Don’t forget about samples
and features
.
mdt[c("Sample1", "Sample2"), c("1", "2")]
--- MDT object ---
mtable(x): 2 individuals and 2 features
annotations(x): 5 rows in 17 columns
info(x): 4 rows in 7 columns
response(x): factor 0: 1 | 1: 1
phenotype(x): 2 rows in 4 columns
On another hand, we can use filterMDT
to keep only non-synonymous variants and remove those with low quality.
mdt <- filterMDT(mdt, QUAL > 50, "info")
mdt
--- MDT object ---
mtable(x): 200 individuals and 945 features
annotations(x): 8671 rows in 17 columns
info(x): 189000 rows in 7 columns
response(x): factor 0: 101 | 1: 99
phenotype(x): 200 rows in 4 columns
Finally, if there are any missing entries, fillMDT
can be used. For example, we can assume that missing values were homozygotes for the REF allele.
fillMDT(mdt, fill = 0L)
--- MDT object ---
mtable(x): 200 individuals and 945 features
annotations(x): 8671 rows in 17 columns
info(x): 189000 rows in 7 columns
response(x): factor 0: 101 | 1: 99
phenotype(x): 200 rows in 4 columns
Plotting for MDT
is possible using plotMDT
. Available plots are scatterplots and heatmaps. Preprocessing such as scaling, centering, and PCA and be applied beforehand.
Here under is a scatter plot of the first two principal components in sample space.
plotMDT(mdt,
type = "scatter",
preprocess = c("scale", "center", "pca"),
space = c("samples"),
dims = c(1L, 2L))
Here under is a heatmap plot of associated SNPs. As you can see, there is a good separation between cases (blue) and controls (red).
try(plotMDT(filterMDT(mdt, INFO_AS == 1, "info"),
type = "heatmap",
preprocess = c("scale", "center")))
Finally, it is possible to create Manhattan plots using any numeric value in annotations
. Here, we will illustrate this using results from Fisher’s exact test we have computed using annotateStat
manhattanPlot(mdt, "FISHER")
MDT
objects can be used to train classifiers; results are contained in MLGWAS
objects. This is possible using trainClassifier
.
This function can be thought of as a wrapper for the train
function in the caret
package. By default, the function is composed of two main loops corresponding to double fold cross-validation. The outer fold is specified by the partitions
argument, a list
of indices which define the training set. The training set is used for preprocessing (a function
specified by the preprocess
argument) and as input for train
. The test set is all other indices; it is neither used preprocessing, training and will be used to estimate performance measures. It is, however, possible to set the test set as the whole dataset by setting validation
to FALSE
. The inner fold is handled by train
and can be set using its trControl
and tuneGrid
argument. Actually, any argument can be passed on to train
.
Here under we train two classifiers using random forest (method = 'rf'
). To make this run faster, we will only retain the associated SNPs and ten others. We will also use variables contained in phenotype
(except RESPONSE
).
allfeat <- MachineLearningGWAS::features(mdt)
associated <- unique(MachineLearningGWAS::info(mdt)[INFO_AS == 1, FeatureID])
random <- sample(allfeat[! allfeat %in% associated], 15)
selected <- c(associated, random)
mdt <- mdt[, selected]
clas <- trainClassifier(mdt, id = "clas", phen_vars = "all", method = 'rf')
An MLGWAS
object is composed of three main data.table
: performances
, featimp
, errors
. These are interlinked by fours identifiers (i.e. keys). SampleID
and FeatureID
are extracted from the MDT
objects. ClassifierID
is determined by the id
argument when calling trainClassifier
. Finally, PartitionID
is determined by the names of the partition
argument. Printing the MLGWAS
object reveals its identifiers. It should be noted that each of these data.table
as an associated summary
and plot
function.
clas
--- MLGWAS object ---
1 classifiers: clas
10 partitions: Fold01.Rep1, Fold02.Rep1, Fold03.Rep1, Fold04.Rep1, Fold05.Rep1, ...
22 features: 105, 114, 135, 18, 30, ...
200 samples: Sample1, Sample10, Sample100, Sample101, Sample102, ...
performances
contains estimates about classifier performance across different folds (partitions). It can be summarized using summaryPerf
performances(clas)
ClassifierID PartitionID Accuracy ROCAUC PPV NPV
1: clas Fold01.Rep1 0.8947368 0.8888889 0.8333333 1.0000000
2: clas Fold02.Rep1 0.8500000 0.8500000 0.8181818 0.8888889
3: clas Fold03.Rep1 1.0000000 1.0000000 1.0000000 1.0000000
4: clas Fold04.Rep1 0.9523810 0.9545455 1.0000000 0.9090909
5: clas Fold05.Rep1 1.0000000 1.0000000 1.0000000 1.0000000
6: clas Fold06.Rep1 0.9500000 0.9500000 0.9090909 1.0000000
7: clas Fold07.Rep1 0.9500000 0.9500000 1.0000000 0.9090909
8: clas Fold08.Rep1 0.9500000 0.9500000 0.9090909 1.0000000
9: clas Fold09.Rep1 0.9000000 0.9000000 0.9000000 0.9000000
10: clas Fold10.Rep1 0.9000000 0.9000000 1.0000000 0.8333333
Sensitivity Specificity F1Score
1: 1.0000000 0.7777778 0.9090909
2: 0.9000000 0.8000000 0.8571429
3: 1.0000000 1.0000000 1.0000000
4: 0.9090909 1.0000000 0.9523810
5: 1.0000000 1.0000000 1.0000000
6: 1.0000000 0.9000000 0.9523810
7: 0.9000000 1.0000000 0.9473684
8: 1.0000000 0.9000000 0.9523810
9: 0.9000000 0.9000000 0.9000000
10: 0.8000000 1.0000000 0.8888889
summaryPerf(clas)
ClassifierID Summary Accuracy ROCAUC PPV NPV
1: clas mean 0.93471178 0.93434343 0.93696970 0.94404040
2: clas sd 0.04798424 0.04865291 0.07275252 0.06266101
3: clas len 10.00000000 10.00000000 10.00000000 10.00000000
Sensitivity Specificity F1Score
1: 0.94090909 0.92777778 0.9359634
2: 0.06939989 0.08642416 0.0466479
3: 10.00000000 10.00000000 10.0000000
featimp
contains feature importance across folds. These values are obtained calling varImp
for caret
. It can be summarized using summaryFeatimp
head(featimp(clas))
ClassifierID PartitionID FeatureID Importance
1: clas Fold01.Rep1 105 6.581981806
2: clas Fold01.Rep1 114 0.008179545
3: clas Fold01.Rep1 135 1.949714639
4: clas Fold01.Rep1 18 9.055058882
5: clas Fold01.Rep1 30 4.795876359
6: clas Fold01.Rep1 350 7.684334815
head(summaryFeatimp(clas))
ClassifierID FeatureID mean sd len
1: clas 105 5.3855108 0.7992505 10
2: clas 114 0.0679622 0.1305055 10
3: clas 135 1.6748299 0.7568015 10
4: clas 18 7.9673276 0.7320299 10
5: clas 30 5.5747477 0.4263509 10
6: clas 350 7.6375052 0.4464784 10
Finally, errors
represent local residuals. It can help identify samples that are too predict. For categorical responses, 0
represents a success and 1
an error. It can be summarized using summaryErrors
head(na.omit(errors(clas)))
ClassifierID SampleID PartitionID Error
1: clas Sample1 Fold02.Rep1 0
2: clas Sample10 Fold06.Rep1 0
3: clas Sample100 Fold03.Rep1 0
4: clas Sample101 Fold01.Rep1 0
5: clas Sample102 Fold01.Rep1 0
6: clas Sample103 Fold10.Rep1 0
head(summaryErrors(clas))
ClassifierID SampleID mean sd len
1: clas Sample1 0 NA 1
2: clas Sample10 0 NA 1
3: clas Sample100 0 NA 1
4: clas Sample101 0 NA 1
5: clas Sample102 0 NA 1
6: clas Sample103 0 NA 1
One may with to compare performance results with a random setting. This can be done using permutations. We then bind the two objects together using bindML
.
perm <- trainClassifier(mdt, id = "perm",
method = 'rf', phen_vars = "all",
permute = TRUE)
fits <- bindML(clas, perm)
There exists different three main functions to plot MLGWAS
objects: plotPerf
, plotFeatimp
and plotErrors
.
plotPerf
plots performance measures. Here under we plot accuracy values. As expected, the classifier using the unpermuted data (clas
) outperforms random predictions (perm
).
plotPerf(fits, "Accuracy")
plotFeatimp
plots feature importance. Associated features are “555”, “615”, “666”, “770”, “917” and “AGE”. It so happens, these are the features being selected by our classifier. The permuted classifier, on the contrary, seams to be selecting features at random.
plotFeatimp(fits)
Finally, it possible to look at residual errors (i.e. per sample error) using plotErrors
.
It is also possible to create different types of models by playing around with arguments. One should also not that any argument in train
can be passed on. For example, we can do 10 bootstraps instead of double fold cross validation with the following code.
ml <- trainClassifier(mdt, id = "boot",
partition = caret::createResample(response(mdt),
times = 10),
trControl = trainControl("none"),
tuneGrid = data.frame(.mtry = 5L),
validation = FALSE)
It also possible to create a preprocessing function that only selects the 5 most associated values as determined by Fisher’s exact test as follows. This can then be used as input to the preprocess
argument.
fisherSelector <- function(x, y) {
pv <- apply(x, 2, function(col) {
fisher.test(table(factor(col, 0:2), y))$p.value
})
pv <- sort(pv)
selected <- names(pv)[1:5]
function(x) x[, selected]
}
Finally, the trainClassifiers
function allow to call trainClassifier
iteratively using a data.frame
of parameters. For example, instead of calling two times trainClassifer
to get a real and a permuted classifier, one can run this code. Note: functions must be defined within a vector
.
params <- expand.grid(method = "rf",
permuted = c(TRUE, FALSE),
preprocess = c(fisherSelector))
params$id <- c("perm", "real")
fits2 <- trainClassifiers(mdt, params)
R version 3.3.1 (2016-06-21)
Platform: x86_64-apple-darwin13.4.0 (64-bit)
Running under: OS X 10.11.6 (El Capitan)
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
attached base packages:
[1] stats4 parallel stats graphics grDevices utils datasets
[8] methods base
other attached packages:
[1] caret_6.0-73
[2] ggplot2_2.2.0
[3] lattice_0.20-34
[4] randomForest_4.6-12
[5] SNPlocs.Hsapiens.dbSNP144.GRCh37_0.99.11
[6] BSgenome.Hsapiens.UCSC.hg19_1.4.0
[7] BSgenome_1.42.0
[8] rtracklayer_1.34.1
[9] org.Hs.eg.db_3.4.0
[10] TxDb.Hsapiens.UCSC.hg19.knownGene_3.2.2
[11] GenomicFeatures_1.26.0
[12] AnnotationDbi_1.36.0
[13] VariantAnnotation_1.20.1
[14] Rsamtools_1.26.1
[15] Biostrings_2.42.0
[16] XVector_0.14.0
[17] SummarizedExperiment_1.4.0
[18] Biobase_2.34.0
[19] GenomicRanges_1.26.1
[20] GenomeInfoDb_1.10.1
[21] IRanges_2.8.1
[22] S4Vectors_0.12.0
[23] BiocGenerics_0.20.0
[24] MachineLearningGWAS_0.1.0
[25] data.table_1.9.6
loaded via a namespace (and not attached):
[1] nlme_3.1-128 bitops_1.0-6
[3] pbkrtest_0.4-6 devtools_1.12.0
[5] rprojroot_1.1 tools_3.3.1
[7] backports_1.0.4 KernSmooth_2.23-15
[9] rpart_4.1-10 DBI_0.5-1
[11] lazyeval_0.2.0 mgcv_1.8-16
[13] colorspace_1.3-1 nnet_7.3-12
[15] withr_1.0.2 gridExtra_2.2.1
[17] compiler_3.3.1 chron_2.3-47
[19] quantreg_5.29 HardyWeinberg_1.5.6
[21] mice_2.25 SparseM_1.74
[23] labeling_0.3 caTools_1.17.1
[25] scales_0.4.1 pbapply_1.3-1
[27] stringr_1.1.0 digest_0.6.10
[29] minqa_1.2.4 rmarkdown_1.2
[31] DOSE_3.0.9 htmltools_0.3.5
[33] lme4_1.1-12 RSQLite_1.0.0
[35] BiocParallel_1.8.1 gtools_3.5.0
[37] GOSemSim_2.0.1 ModelMetrics_1.1.0
[39] car_2.1-3 RCurl_1.95-4.8
[41] magrittr_1.5 GO.db_3.4.0
[43] Matrix_1.2-7.1 Rcpp_0.12.8
[45] munsell_0.4.3 pROC_1.8
[47] stringi_1.1.2 yaml_2.1.14
[49] MASS_7.3-45 zlibbioc_1.20.0
[51] gplots_3.0.1 plyr_1.8.4
[53] qvalue_2.6.0 grid_3.3.1
[55] gdata_2.17.0 DO.db_2.9
[57] splines_3.3.1 knitr_1.15.1
[59] fgsea_1.0.2 igraph_1.0.1
[61] reshape2_1.4.2 codetools_0.2-14
[63] biomaRt_2.30.0 fastmatch_1.0-4
[65] XML_3.98-1.5 evaluate_0.10
[67] nloptr_1.0.4 foreach_1.4.3
[69] MatrixModels_0.4-1 gtable_0.2.0
[71] assertthat_0.1 e1071_1.6-7
[73] class_7.3-14 survival_2.40-1
[75] tibble_1.2 iterators_1.0.8
[77] GenomicAlignments_1.10.0 memoise_1.0.0