Last updated: 2018-08-24
workflowr checks: (Click a bullet for more information) ✔ R Markdown file: up-to-date
Great! Since the R Markdown file has been committed to the Git repository, you know the exact version of the code that produced these results.
✔ Environment: empty
Great job! The global environment was empty. Objects defined in the global environment can affect the analysis in your R Markdown file in unknown ways. For reproduciblity it’s best to always run the code in an empty environment.
✔ Seed:
set.seed(20180807)
The command set.seed(20180807)
was run prior to running the code in the R Markdown file. Setting a seed ensures that any results that rely on randomness, e.g. subsampling or permutations, are reproducible.
✔ Session information: recorded
Great job! Recording the operating system, R version, and package versions is critical for reproducibility.
✔ Repository version: b1cf8d0
wflow_publish
or wflow_git_commit
). workflowr only checks the R Markdown file, but you know if there are other scripts or data files that it depends on. Below is the status of the Git repository when the results were generated:
Ignored files:
Ignored: .DS_Store
Ignored: .Rhistory
Ignored: .Rproj.user/
Ignored: data/raw/
Ignored: src/.DS_Store
Ignored: src/.ipynb_checkpoints/
Ignored: src/Rmd/.Rhistory
Untracked files:
Untracked: Snakefile_clonality
Untracked: Snakefile_somatic_calling
Untracked: code/analysis_for_garx.Rmd
Untracked: code/yuanhua/
Untracked: data/canopy/
Untracked: data/cell_assignment/
Untracked: data/de_analysis_FTv62/
Untracked: data/donor_info_070818.txt
Untracked: data/donor_info_core.csv
Untracked: data/donor_neutrality.tsv
Untracked: data/fdr10.annot.txt.gz
Untracked: data/high-vs-low-exomes.v62.ft.alldonors-filt_lenient.all_filt_sites.vep_most_severe_csq.txt
Untracked: data/high-vs-low-exomes.v62.ft.filt_lenient-alldonors.txt.gz
Untracked: data/human_H_v5p2.rdata
Untracked: data/human_c2_v5p2.rdata
Untracked: data/human_c6_v5p2.rdata
Untracked: data/neg-bin-rsquared-petr.csv
Untracked: data/neutralitytestr-petr.tsv
Untracked: data/sce_merged_donors_cardelino_donorid_all_qc_filt.rds
Untracked: data/sce_merged_donors_cardelino_donorid_all_with_qc_labels.rds
Untracked: data/sce_merged_donors_cardelino_donorid_unstim_qc_filt.rds
Untracked: data/sces/
Untracked: data/simulations/
Untracked: data/variance_components/
Untracked: figures/
Untracked: metadata/
Untracked: output/differential_expression/
Untracked: output/donor_specific/
Untracked: output/line_info.tsv
Untracked: output/nvars_by_category_by_donor.tsv
Untracked: output/nvars_by_category_by_line.tsv
Untracked: output/variance_components/
Untracked: references/
Note that any generated files, e.g. HTML, png, CSS, etc., are not included in this status report because it is ok for generated content to have uncommitted changes.
File | Version | Author | Date | Message |
---|---|---|---|---|
html | 8f884ae | davismcc | 2018-08-24 | Adding data pre-processing and line overview html files |
Rmd | 5aa174b | davismcc | 2018-08-20 | Tidying up and saving dataframe to output file |
Rmd | fc582db | davismcc | 2018-08-20 | Fixing up small bug. |
Rmd | 3856e54 | davismcc | 2018-08-20 | Adding overview analysis of lines |
This document provides overview information for 32 healthy human fibroblast cell lines used in this project. Note that each cell line was each derived from a distinct donor, so we use the terms “line” and “donor” interchangeably.
library(readr)
library(dplyr)
library(scran)
library(scater)
library(viridis)
library(ggplot2)
library(ggforce)
library(ggridges)
library(SingleCellExperiment)
library(edgeR)
library(limma)
library(org.Hs.eg.db)
library(cowplot)
library(gplots)
library(ggrepel)
library(sigfit)
library(Rcpp)
library(deconstructSigs)
options(stringsAsFactors = FALSE)
Load donor level information.
donor_info <- as.data.frame(read_csv("data/donor_info_core.csv"))
# merge age bins
donor_info$age_decade <- ""
for (i in 1:nrow(donor_info)) {
if (donor_info$age[i] %in% c("30-34", "35-39"))
donor_info$age_decade[i] <- "30-39"
if (donor_info$age[i] %in% c("40-44", "45-49"))
donor_info$age_decade[i] <- "40-49"
if (donor_info$age[i] %in% c("50-54", "55-59"))
donor_info$age_decade[i] <- "50-59"
if (donor_info$age[i] %in% c("60-64", "65-69"))
donor_info$age_decade[i] <- "60-69"
if (donor_info$age[i] %in% c("70-74", "75-79"))
donor_info$age_decade[i] <- "70-79"
}
Load exome variant sites.
exome_sites <- read_tsv("data/high-vs-low-exomes.v62.ft.filt_lenient-alldonors.txt.gz",
col_types = "ciccdcciiiiccccccccddcdcll", comment = "#",
col_names = TRUE)
exome_sites <- dplyr::mutate(
exome_sites,
chrom = paste0("chr", gsub("chr", "", chrom)),
var_id = paste0(chrom, ":", pos, "_", ref, "_", alt),
chr_pos = paste0(chrom, "_", pos))
exome_sites <- as.data.frame(exome_sites)
## deduplicate sites list
exome_sites <- exome_sites[!duplicated(exome_sites[["var_id"]]),]
## calculate coverage at sites for each donor
donor_vars_coverage <- list()
for (i in unique(exome_sites$donor_short_id)) {
exome_sites_subset <- exome_sites[exome_sites$donor_short_id == i, ]
donor_vars_coverage[[i]] <- exome_sites_subset$nREF_fibro + exome_sites_subset$nALT_fibro
}
Load VEP annotations and show table with number of variants assigned to each functional annotation category.
vep_best <- read_tsv("data/high-vs-low-exomes.v62.ft.alldonors-filt_lenient.all_filt_sites.vep_most_severe_csq.txt")
colnames(vep_best)[1] <- "Uploaded_variation"
## deduplicate dataframe
vep_best <- as.data.frame(vep_best[!duplicated(vep_best[["Uploaded_variation"]]),])
as.data.frame(table(vep_best[["Consequence"]]))
Var1 Freq
1 3_prime_UTR_variant 181
2 5_prime_UTR_variant 121
3 downstream_gene_variant 13
4 intergenic_variant 13
5 intron_variant 1539
6 mature_miRNA_variant 3
7 missense_variant 3648
8 non_coding_transcript_exon_variant 428
9 regulatory_region_variant 1
10 splice_acceptor_variant 41
11 splice_donor_variant 24
12 splice_region_variant 291
13 start_lost 9
14 stop_gained 227
15 stop_lost 2
16 stop_retained_variant 5
17 synonymous_variant 1923
18 upstream_gene_variant 34
Add consequences to exome sites.
vep_best[["var_id"]] <- paste0("chr", vep_best[["Uploaded_variation"]])
exome_sites <- inner_join(exome_sites,
vep_best[, c("var_id", "Location", "Consequence")],
by = "var_id")
Add donor level mutation information (aggregate across impacts) Not used in manuscript, but still calculated to store in donor_info
table.
impactful_csq <- c("stop_lost", "start_lost", "stop_gained",
"splice_donor_variant", "splice_acceptor_variant",
"splice_region_variant", "missense_variant")
donor_info$num_mutations <- NA
donor_info$num_synonymous <- NA
donor_info$num_missense <- NA
donor_info$num_splice_region <- NA
donor_info$num_splice_acceptor <- NA
donor_info$num_splice_donor <- NA
donor_info$num_stop_gained <- NA
donor_info$num_start_lost <- NA
donor_info$num_stop_lost <- NA
for (i in unique(donor_info$donor_short)) {
if (i %in% unique(exome_sites$donor_short_id)) {
exome_sites_subset <- exome_sites[exome_sites$donor_short_id == i, ]
donor_info$num_mutations[donor_info$donor_short == i] <- length(exome_sites_subset$Consequence)
donor_info$num_synonymous[donor_info$donor_short == i] <- sum(exome_sites_subset$Consequence == "synonymous_variant")
donor_info$num_missense[donor_info$donor_short == i] <- sum(exome_sites_subset$Consequence == "missense_variant")
donor_info$num_splice_region[donor_info$donor_short == i] <- sum(exome_sites_subset$Consequence == "splice_region_variant")
donor_info$num_splice_acceptor[donor_info$donor_short == i] <- sum(exome_sites_subset$Consequence == "splice_acceptor_variant")
donor_info$num_splice_donor[donor_info$donor_short == i] <- sum(exome_sites_subset$Consequence == "splice_donor_variant")
donor_info$num_stop_gained[donor_info$donor_short == i] <- sum(exome_sites_subset$Consequence == "stop_gained")
donor_info$num_start_lost[donor_info$donor_short == i] <- sum(exome_sites_subset$Consequence == "start_lost")
donor_info$num_stop_lost[donor_info$donor_short == i] <- sum(exome_sites_subset$Consequence == "stop_lost")
}
}
Add donor level mutation selection info (Daniel Kunz). We produce a scatter plot of goodness of fit for each line for the cumulative mutations (Williams et al, 2016) and the negative binomial (Simons, 2016) models of neutral evolution. We label the example line joxm
, which is analysed in more depth in other scripts.
ntrtestrPetr <- read.table("data/neutralitytestr-petr.tsv",
stringsAsFactors = FALSE, header = TRUE)
negbinfitPetr = read.table("data/neg-bin-rsquared-petr.csv",
stringsAsFactors = FALSE, header = TRUE, sep = ",")
negbinfitPetr$sampleID <- negbinfitPetr$fname
negbinfitPetr$sampleID <- gsub("petr-AF-", "", negbinfitPetr$sampleID)
negbinfitPetr$sampleID <- gsub(".tsv", "", negbinfitPetr$sampleID)
rownames(negbinfitPetr) <- negbinfitPetr$sampleID
dfrsq <- data.frame(sampleID = ntrtestrPetr$sampleID,
rsq_ntrtestr = ntrtestrPetr$rsq,
rsq_negbinfit = negbinfitPetr[ntrtestrPetr$sampleID, "rsq"])
cutoff_selection_cummut <- 0.85
cutoff_selection_negbin <- 0.25
cutoff_neutral_cummut <- 0.9
cutoff_neutral_negbin <- 0.55
dfrsq$candidatelabel <- NA
dfrsq$candidatelabel[dfrsq$sampleID == "joxm"] <- "joxm"
filter_selection <- (dfrsq$rsq_ntrtestr < cutoff_selection_cummut) &
(dfrsq$rsq_negbinfit < cutoff_selection_negbin)
filter_neutral <- (dfrsq$rsq_ntrtestr > cutoff_neutral_cummut) &
(dfrsq$rsq_negbinfit > cutoff_neutral_negbin)
dfrsq$selection <- "undetermined"
dfrsq$selection[filter_selection] <- "selected"
dfrsq$selection[filter_neutral] <- "neutral"
colnames(dfrsq)[1] <- "donor_short"
donor_info <- merge(donor_info, dfrsq, by = "donor_short")
donor_info$selection_colour <- "#CCCCCC"
for (i in 1:nrow(donor_info)) {
if (donor_info$selection[i] == "neutral")
donor_info$selection_colour[i] <- "dodgerblue"
if (donor_info$selection[i] == "selected")
donor_info$selection_colour[i] <- "dodgerblue4"
}
donors <- c("euts", "fawm", "feec", "fikt", "garx", "gesg", "heja", "hipn",
"ieki", "joxm", "kuco", "laey", "lexy", "naju", "nusw", "oaaz",
"oilg", "pipw", "puie", "qayj", "qolg", "qonc", "rozh", "sehl",
"ualf", "vass", "vils", "vuna", "wahn", "wetu", "xugn", "zoxy")
dfrsq_filt <- dfrsq[(dfrsq$donor_short %in% donors),]
plt_scatter <- ggplot(dfrsq_filt, aes(x = rsq_negbinfit, y = rsq_ntrtestr)) +
scale_colour_manual(values = c("neutral" = "dodgerblue",
"selected" = "dodgerblue4",
"undetermined" = "#CCCCCC")) +
geom_point(aes(colour = selection)) +
geom_label_repel(aes(label = candidatelabel), color = "white", size = 2.5,
fill = "black", box.padding = 0.35, point.padding = 0.5,
segment.color = "grey50") +
theme_bw() +
theme(text = element_text(size = 9,face = "bold"),
axis.text = element_text(size = 9, face = "bold"),
axis.title = element_text(size = 10, face = "bold"),
plot.title = element_text(size = 9, hjust = 0.5)) +
labs(x = "Goodness of Fit - Negative Binomial Distribution",
y = "Goodness of Fit - Cumulative Mutations") +
theme(strip.background = element_blank()) +
labs(title = "") +
theme(legend.justification = c(1,0), legend.position = c(1,0)) +
theme(legend.background = element_rect(fill = "white", linetype = 1,
colour = "black", size =0.2),
legend.key.size = unit(0.25, "cm")) +
labs(colour = "Selection") #+
# coord_fixed()
ggsave("figures/overview_lines/neutral_selection_models_gof_scatter.png",
plot = plt_scatter, width = 12, height = 12, dpi = 300, units = "cm")
plt_scatter
Add donor level mutation signature exposures, using the sigfit
package and 30 COSMIC signatures. We load the filtered exome variant sites and calculate the tri-nucleotide context for each variant (required for computing signature exposures), using a function from the deconstructSigs
package.
data("cosmic_signatures", package = "sigfit")
##new input
mutation_list <- read.table("data/high-vs-low-exomes.v62.ft.filt_lenient-alldonors.txt.gz", header = TRUE)
mutation_list$chr_pos <- paste0("chr", mutation_list$chr, "_", mutation_list$pos)
mutation_donors <- unique(mutation_list$donor_short_id)
mutation_list_donors <- list()
for (i in mutation_donors) {
cat("....reading ", i, "\n")
mutation_list_donors[[i]] <- mutation_list[which(mutation_list$donor_short_id == i),]
mutation_list_donors[[i]]$chr <- paste0("chr", mutation_list_donors[[i]]$chrom)
mutation_list_donors[[i]]$chr_pos = paste0(mutation_list_donors[[i]]$chr, "_", mutation_list_donors[[i]]$pos)
}
## Calculate triNucleotide contexts for mutations using deconstructSigs command
mut_triNs <- list()
for (i in mutation_donors) {
cat("....processing ", i, "\n")
mut_triNs[[i]] <- mut.to.sigs.input(mutation_list_donors[[i]], sample.id = "donor_short_id",
chr = "chr", pos = "pos", ref = "ref", alt = "alt")
}
## Convert to correct format
sig_triNs <- character()
for (j in 1:96) {
c1 <- substr(colnames(mut_triNs[[1]])[j], 1,1)
ref <- substr(colnames(mut_triNs[[1]])[j], 3,3)
alt <- substr(colnames(mut_triNs[[1]])[j], 5,5)
c3 <- substr(colnames(mut_triNs[[1]])[j], 7,7)
triN_sigfit <- paste0(c1,ref,c3,">",c1,alt,c3)
sig_triNs[j] <- triN_sigfit
}
for(i in mutation_donors) {
colnames(mut_triNs[[i]]) <- sig_triNs
}
## Fit signatures using sigfit
mcmc_samples_fit <- list()
set.seed(1234)
for (i in mutation_donors) {
mcmc_samples_fit[[i]] <- sigfit::fit_signatures(
counts = mut_triNs[[i]], signatures = cosmic_signatures,
iter = 2000, warmup = 1000, chains = 1, seed = 1)
}
## Estimate exposures using sigfit
exposures <- list()
for (i in mutation_donors) {
exposures[[i]] <- sigfit::retrieve_pars(
mcmc_samples_fit[[i]], par = "exposures", hpd_prob = 0.90)
}
Plot an exposure barchart for each line.
## Plot exposure bar charts
donors <- c("euts", "fawm", "feec", "fikt", "garx", "gesg", "heja", "hipn",
"ieki", "joxm", "kuco", "laey", "lexy", "naju", "nusw", "oaaz",
"oilg", "pipw", "puie", "qayj", "qolg", "qonc", "rozh", "sehl",
"ualf", "vass", "vils", "vuna", "wahn", "wetu", "xugn", "zoxy")
signature_names <- c("1", "2", "3", "4", "5", "6", "7", "8", "9", "10", "11",
"12", "13", "14", "15", "16", "17", "18", "19", "20", "21",
"22", "23", "24", "25", "26", "27", "28", "29", "30")
for (j in donors) {
cat("....plotting ", j, "\n")
sigfit::plot_exposures(mcmc_samples_fit[[j]],
signature_names = signature_names)
png(paste0("figures/overview_lines/mutational_signatures/exposure_barchart_",
j, ".png"),
units = "in", width = 12, height = 10, res = 500)
sigfit::plot_exposures(mcmc_samples_fit[[j]],
signature_names = signature_names)
dev.off()
}
....plotting euts
....plotting fawm
....plotting feec
....plotting fikt
....plotting garx
....plotting gesg
....plotting heja
....plotting hipn
....plotting ieki
....plotting joxm
....plotting kuco
....plotting laey
....plotting lexy
....plotting naju
....plotting nusw
....plotting oaaz
....plotting oilg
....plotting pipw
....plotting puie
....plotting qayj
....plotting qolg
....plotting qonc
....plotting rozh
....plotting sehl
....plotting ualf
....plotting vass
....plotting vils
....plotting vuna
....plotting wahn
....plotting wetu
....plotting xugn
....plotting zoxy
Retrieve exposures for a given signal. Specifically, we will look an highest posterior density (HPD) intervals and mean exposures for Signature 7 (UV) and Signature 11, the only two that are significant across multiple lines. We add this information to the donor_info
dataframe.
get_signature_df <- function(exposures, samples, signature) {
signature_mat_mean <- matrix(NA, nrow = length(samples), 30)
for (i in 1:length(samples)) {
signature_mat_mean[i,] <- as.numeric(exposures[[i]]$mean)
}
rownames(signature_mat_mean) <- samples
colnames(signature_mat_mean) <- colnames(exposures[[i]]$mean)
signature_mat_lower90 <- matrix(NA, nrow = length(samples), 30)
for (i in 1:length(samples)) {
signature_mat_lower90[i,] <- as.numeric(exposures[[i]]$lower_90)
}
rownames(signature_mat_lower90) <- samples
colnames(signature_mat_lower90) <- colnames(exposures[[i]]$lower_90)
signature_mat_upper90 <- matrix(NA, nrow = length(samples), 30)
for (i in 1:length(samples)) {
signature_mat_upper90[i,] <- as.numeric(exposures[[i]]$upper_90)
}
rownames(signature_mat_upper90) <- samples
colnames(signature_mat_upper90) <- colnames(exposures[[i]]$upper_90)
signature_df <- cbind(
as.data.frame(signature_mat_mean[,signature]),
as.data.frame(signature_mat_lower90[,signature]),
as.data.frame(signature_mat_upper90[,signature]))
colnames(signature_df) <- c(paste0("Sig",signature,"_mean"),
paste0("Sig",signature,"_lower"),
paste0("Sig",signature,"_upper"))
signature_df$donor <- rownames(signature_df)
signature_df
}
## Get lower, mean and upper values for signatures 7 & 11
sig7_df <- get_signature_df(exposures, mutation_donors, 7)
sig11_df <- get_signature_df(exposures, mutation_donors, 11)
## Add Sigs 7/11 to donor info table
sig_subset_df <- merge(sig7_df,sig11_df, by = "donor")
sig_subset_df_means <- sig_subset_df[,c(1,grep("mean", colnames(sig_subset_df)))]
sig_subset_df_means_melt <- reshape2::melt(sig_subset_df_means)
sig_subset_df_means_melt$signature <- substr(sig_subset_df_means_melt$variable, 1, 5)
sig_subset_df_means_melt <- sig_subset_df_means_melt[,c("donor", "signature", "value")]
colnames(sig_subset_df)[1] <- "donor_short"
donor_info <- merge(donor_info, sig_subset_df, by = "donor_short")
Read in annotated SingleCellExperiment (SCE) objects and create a list of SCE objects containing all cells used for analysis and their assignment (using cardelino
) to clones identified with Canopy from whole-exome sequencing data.
params <- list()
params$callset <- "filt_lenient.cell_coverage_sites"
fls <- list.files("data/sces")
fls <- fls[grepl(params$callset, fls)]
donors <- gsub(".*ce_([a-z]+)_.*", "\\1", fls)
sce_unst_list <- list()
for (don in donors) {
sce_unst_list[[don]] <- readRDS(file.path("data/sces",
paste0("sce_", don, "_with_clone_assignments.", params$callset, ".rds")))
cat(paste("reading", don, ": ", ncol(sce_unst_list[[don]]), "cells.\n"))
}
reading euts : 79 cells.
reading fawm : 53 cells.
reading feec : 75 cells.
reading fikt : 39 cells.
reading garx : 70 cells.
reading gesg : 105 cells.
reading heja : 50 cells.
reading hipn : 62 cells.
reading ieki : 58 cells.
reading joxm : 79 cells.
reading kuco : 48 cells.
reading laey : 55 cells.
reading lexy : 63 cells.
reading naju : 44 cells.
reading nusw : 60 cells.
reading oaaz : 38 cells.
reading oilg : 90 cells.
reading pipw : 107 cells.
reading puie : 41 cells.
reading qayj : 97 cells.
reading qolg : 36 cells.
reading qonc : 58 cells.
reading rozh : 91 cells.
reading sehl : 30 cells.
reading ualf : 89 cells.
reading vass : 37 cells.
reading vils : 37 cells.
reading vuna : 71 cells.
reading wahn : 82 cells.
reading wetu : 77 cells.
reading xugn : 35 cells.
reading zoxy : 88 cells.
## Calculate single cell data metrics
sc_metrics_summary <- list()
sc_metrics_summary_df <- data.frame()
for (don in donors) {
sc_metrics_summary[[don]]$num_unst_cells <- ncol(sce_unst_list[[don]])
sc_metrics_summary[[don]]$num_assignable <- sum(sce_unst_list[[don]]$assignable)
sc_metrics_summary[[don]]$num_unassignable <-
(sc_metrics_summary[[don]]$num_unst_cells -
sc_metrics_summary[[don]]$num_assignable)
sc_metrics_summary[[don]]$num_clones_with_cells <-
length(unique(sce_unst_list[[don]]$assigned[
which(sce_unst_list[[don]]$assigned != "unassigned")]))
sc_metrics_summary[[don]]$donor <- don
sc_metrics_summary_df <-
rbind(sc_metrics_summary_df, as.data.frame(sc_metrics_summary[[don]]))
}
colnames(sc_metrics_summary_df) <- c("total_unst_cells", "assigned_unst_cells",
"unassigned_unst_cells",
"num_clones_with_cells", "donor_short")
## Merge with donor info table
donor_info <- merge(donor_info, sc_metrics_summary_df, by = "donor_short",
all.x = TRUE)
donor_info$percent_assigned_cells <-
donor_info$assigned_unst_cells / donor_info$total_unst_cells
First, we read in the canopy output for each line we analyse.
canopy_files <- list.files("data/canopy")
canopy_files <- canopy_files[grepl(params$callset, canopy_files)]
canopy_list <- list()
for (don in donors) {
canopy_list[[don]] <- readRDS(
file.path("data/canopy",
paste0("canopy_results.", don, ".", params$callset, ".rds")))
}
Second, summarise the number of mutations for each clone, for each line. Form these results into a dataframe.
clone_mut_list <- list()
for (don in donors) {
cat(paste("summarising clones and mutations for", don, '\n'))
clone_mut_list[[don]] <- colSums(canopy_list[[don]]$tree$Z)
cat(colSums(canopy_list[[don]]$tree$Z))
cat('\n')
}
summarising clones and mutations for euts
0 39 29
summarising clones and mutations for fawm
0 17 16
summarising clones and mutations for feec
0 19 9 6
summarising clones and mutations for fikt
0 13 23
summarising clones and mutations for garx
0 79 57
summarising clones and mutations for gesg
0 37 23
summarising clones and mutations for heja
0 16 20
summarising clones and mutations for hipn
0 8 8
summarising clones and mutations for ieki
0 7 10
summarising clones and mutations for joxm
0 41 98
summarising clones and mutations for kuco
0 9
summarising clones and mutations for laey
0 45 49
summarising clones and mutations for lexy
0 9 6
summarising clones and mutations for naju
0 13
summarising clones and mutations for nusw
0 3 13
summarising clones and mutations for oaaz
0 17 19
summarising clones and mutations for oilg
0 2 37
summarising clones and mutations for pipw
0 34 36
summarising clones and mutations for puie
0 13 15
summarising clones and mutations for qayj
0 11 7
summarising clones and mutations for qolg
0 23
summarising clones and mutations for qonc
0 17 7
summarising clones and mutations for rozh
0 11 10 12
summarising clones and mutations for sehl
0 2 11 23
summarising clones and mutations for ualf
0 29 39
summarising clones and mutations for vass
0 98 35
summarising clones and mutations for vils
0 1 2 8
summarising clones and mutations for vuna
0 33
summarising clones and mutations for wahn
0 52 114
summarising clones and mutations for wetu
0 8 17
summarising clones and mutations for xugn
0 16 12
summarising clones and mutations for zoxy
0 14 8
clone_mut_df <- data.frame(clone1 = numeric(), clone2 = numeric(),
clone3 = numeric(), clone4 = numeric())
for (i in 1:length(donors)) {
num_clones <- length(clone_mut_list[[i]])
num_NAs <- 4 - num_clones
temp_row <- c(clone_mut_list[[i]], rep(NA, num_NAs))
clone_mut_df[i,] <- temp_row
}
rownames(clone_mut_df) <- donors
clone_mut_df$donor_short <- donors
Next, summarise the number of unique mutations tagging each clone identified by Canopy (that is, the number of variants for each clone that distinguish it from other clones in the line). Produce a dataframe with this information as well.
clone_mut_unique_list <- list()
for (don in donors) {
cat(paste("summarising clones and mutations for", don, '\n'))
clone_mut_unique_list[[don]] <-
colSums(canopy_list[[don]]$tree$Z[
(rowSums(canopy_list[[don]]$tree$Z) == 1),])
cat(colSums(canopy_list[[don]]$tree$Z[
(rowSums(canopy_list[[don]]$tree$Z) == 1),]))
cat('\n')
}
summarising clones and mutations for euts
0 20 10
summarising clones and mutations for fawm
0 3 2
summarising clones and mutations for feec
0 16 4 1
summarising clones and mutations for fikt
0 3 13
summarising clones and mutations for garx
0 41 19
summarising clones and mutations for gesg
0 20 6
summarising clones and mutations for heja
0 8 12
summarising clones and mutations for hipn
0 6 6
summarising clones and mutations for ieki
0 6 9
summarising clones and mutations for joxm
0 14 71
summarising clones and mutations for kuco
0 9
summarising clones and mutations for laey
0 16 20
summarising clones and mutations for lexy
0 5 2
summarising clones and mutations for naju
0 13
summarising clones and mutations for nusw
0 0 10
summarising clones and mutations for oaaz
0 8 10
summarising clones and mutations for oilg
0 2 37
summarising clones and mutations for pipw
0 17 19
summarising clones and mutations for puie
0 4 6
summarising clones and mutations for qayj
0 7 3
summarising clones and mutations for qolg
0 23
summarising clones and mutations for qonc
0 13 3
summarising clones and mutations for rozh
0 7 0 2
summarising clones and mutations for sehl
0 0 3 15
summarising clones and mutations for ualf
0 15 25
summarising clones and mutations for vass
0 65 2
summarising clones and mutations for vils
0 0 1 7
summarising clones and mutations for vuna
0 33
summarising clones and mutations for wahn
0 1 63
summarising clones and mutations for wetu
0 2 11
summarising clones and mutations for xugn
0 6 2
summarising clones and mutations for zoxy
0 7 1
clone_mut_unique_df <- data.frame(clone1 = numeric(), clone2 = numeric(),
clone3 = numeric(),
clone4 = numeric(), min_unique_muts = numeric())
for (i in 1:length(donors)) {
num_clones <- length(clone_mut_unique_list[[i]])
num_NAs <- 4 - num_clones
temp_row <- c(clone_mut_unique_list[[i]], rep(NA, num_NAs),
min(clone_mut_unique_list[[i]][2:num_clones]))
clone_mut_unique_df[i,] <- temp_row
}
rownames(clone_mut_df) <- donors
clone_mut_df$donor_short <- donors
rownames(clone_mut_unique_df) <- donors
clone_mut_unique_df$donor_short <- donors
donor_info <- merge(donor_info, clone_mut_unique_df, by = "donor_short", all.x = T)
Finally, calculate the minimum Hamming distance between pairs of clones for each line. In general, assignment of cells to clones will be easier/more successful for lines with larger numbers of variants distinguishes between clones (that is, a high minimum Hamming distance).
We add all of this information to the donor_info
dataframe and then have the data prepared to make some overview plots across lines.
## Calculate Hamming distance
clone_mut_list_hamming <- list()
for (don in donors) {
Config <- canopy_list[[don]]$tree$Z
unique_sites_paired <- c()
for (i in seq_len(ncol(Config) - 1)) {
for (j in seq(i + 1, ncol(Config))) {
n_sites <- sum(rowSums(Config[, c(i,j)]) == 1)
unique_sites_paired <- c(unique_sites_paired, n_sites)
}
}
clone_mut_list_hamming[[don]] <- unique_sites_paired
cat("....hamming distances for ", don, ": ", unique_sites_paired, "\n")
}
....hamming distances for euts : 39 29 30
....hamming distances for fawm : 17 16 5
....hamming distances for feec : 19 9 6 22 19 5
....hamming distances for fikt : 13 23 16
....hamming distances for garx : 79 57 60
....hamming distances for gesg : 37 23 26
....hamming distances for heja : 16 20 20
....hamming distances for hipn : 8 8 12
....hamming distances for ieki : 7 10 15
....hamming distances for joxm : 41 98 85
....hamming distances for kuco : 9
....hamming distances for laey : 45 49 36
....hamming distances for lexy : 9 6 7
....hamming distances for naju : 13
....hamming distances for nusw : 3 13 10
....hamming distances for oaaz : 17 19 18
....hamming distances for oilg : 2 37 39
....hamming distances for pipw : 34 36 36
....hamming distances for puie : 13 15 10
....hamming distances for qayj : 11 7 10
....hamming distances for qolg : 23
....hamming distances for qonc : 17 7 16
....hamming distances for rozh : 11 10 12 13 15 2
....hamming distances for sehl : 2 11 23 9 21 18
....hamming distances for ualf : 29 39 40
....hamming distances for vass : 98 35 67
....hamming distances for vils : 1 2 8 1 7 8
....hamming distances for vuna : 33
....hamming distances for wahn : 52 114 64
....hamming distances for wetu : 8 17 13
....hamming distances for xugn : 16 12 8
....hamming distances for zoxy : 14 8 8
min_hamming_distance <- data.frame("donor_short" = donors,
"min_hamming_dist" = 0)
for (i in 1:length(donors)) {
min_hamming_distance$min_hamming_dist[i] <- min(clone_mut_list_hamming[[i]])
}
donor_info <- merge(donor_info, min_hamming_distance, by = "donor_short",
all.x = TRUE)
## Number of clones
donor_info$num_clones_total <- 0
for (i in donors) {
num_clones <- length(clone_mut_unique_list[[i]])
donor_info$num_clones_total[which(donor_info$donor_short == i)] <- num_clones
}
We make a large combined plot stitching together individual plots showing:
for each line.
donor_info_filt <- donor_info[which(donor_info$donor_short %in% donors), ]
donor_filt_order <- dplyr::arrange(donor_info_filt, desc(num_mutations))
donor_filt_order <- donor_filt_order$donor_short
rownames(donor_info_filt) <- donor_info_filt$donor_short
# Plot age
selected_vars <- c("donor_short", "age_decade", "selection_colour")
donor_info_filt_melt <- reshape2::melt(donor_info_filt[,selected_vars],
"donor_short")
donor_info_filt_melt <- donor_info_filt[,selected_vars]
age_plot_filt <- ggplot(
donor_info_filt_melt,
aes(x = "Age", y = factor(donor_short, levels = rev(donor_filt_order)),
fill = age_decade)) +
geom_tile() +
labs(x = "", y = "Donor", fill = "") +
scale_fill_manual(values = rev(c(magma(8)[-c(1,8)]))) +
guides(fill = guide_legend(nrow = 3,byrow = TRUE)) +
theme(axis.text.x = element_text(size = 28, face = "bold"),
axis.line = element_blank(),
legend.position = "top", legend.direction = "horizontal",
legend.text = element_text(size = 24, face = "bold"),
legend.key.size = unit(0.3,"in"),
axis.ticks.x = element_blank(),
legend.margin = margin(unit(0.01, "cm")),
axis.text.y = element_text(size = 28, face = "bold"),
axis.title.y = element_text(size = 28, face = "bold"),
axis.title.x = element_text(size = 28, face = "bold"))
# Plot number of mutations
selected_vars <- c("donor_short", "num_mutations")
donor_info_filt_melt <- reshape2::melt(donor_info_filt[,selected_vars],
"donor_short")
num_mut_plot_filt <- ggplot(
donor_info_filt_melt,
aes(x = value, y = factor(donor_short, levels = rev(donor_filt_order)))) +
geom_point(size = 4, alpha = 0.7) +
scale_shape_manual(values = c(4, 19)) +
scale_x_log10(breaks=c(50, 100, 500)) +
theme_bw(16) +
ggtitle(" ") +
labs(x = "Number of mutations", y = "") +
theme(axis.text.y = element_blank(),
axis.ticks.y = element_blank(),
axis.line = element_blank(),
legend.position = "top",
axis.text.x = element_text(size = 28, face = "bold"),
title = element_text(colour="black",size = 26, face = "bold"),
axis.title.x = element_text(size = 28, face = "bold"))
# Plot signature 7
donor_info_filt_melt <- reshape2::melt(
donor_info_filt[,c(1,grep("Sig7_mean",colnames(donor_info_filt)))])
donor_info_filt_melt$signature <- substr(donor_info_filt_melt$variable, 1, 5)
donor_info_filt_melt <-
donor_info_filt_melt[, c("donor_short", "signature", "value")]
sig_decomp_plot_filt <- ggplot(
donor_info_filt_melt,
aes(y = value, x = factor(donor_short, levels = rev(donor_filt_order)),
fill = factor(signature))) +
geom_bar(stat = "identity") +
coord_flip() +
scale_fill_manual(values = rev(c(magma(10)[-c(1,10)]))) +
theme_bw(16) +
ggtitle(" ") +
labs(x = "",y = "Signature 7 exposure", fill = "") +
theme(axis.text.y = element_blank(),
axis.ticks.y = element_blank(),
axis.line = element_blank(),
axis.text.x = element_text(size = 28, face = "bold"),
title = element_text(colour = "black", size = 26, face = "bold"),
axis.title.x = element_text(size = 28, face = "bold")) +
guides(fill = FALSE)
# Plot total number of mutations per clone
donor_info_filt_subset <-
donor_info_filt[,c(1,grep("clone",colnames(donor_info_filt)))]
# Remove columns that do not relate to number of mutations per clone:
donor_info_filt_subset <- donor_info_filt_subset[,c(-2,-7)]
donor_info_filt_melt <- reshape2::melt(donor_info_filt_subset, "donor_short")
donor_info_filt_melt$variable <- substr(donor_info_filt_melt$variable, 6, 6)
num_clones_plot_filt <- ggplot(
donor_info_filt_melt,
aes(variable, factor(donor_short, levels = rev(donor_filt_order)))) +
geom_tile(aes(fill = value), colour = "white") +
scale_fill_distiller(palette = "BuPu", values = c(0,0.05,0.1,0.15,0.25,0.5,1),
na.value = "white", breaks = c(0,25,50,75),
direction = 1) +
labs(x = "Clone", y = "", fill = "Number of mutations per clone") +
theme(axis.ticks.y = element_blank(),
axis.line = element_blank(),
legend.position = "top",
legend.key.size = unit(0.5,"in"),
axis.title.x = element_text(size = 28, face = "bold"),
axis.text.x = element_text(size = 28, face = "bold"),
axis.text.y = element_blank(),
legend.justification = "center",
legend.text = element_text(size = 26, face = "bold"),
legend.title = element_text(size = 26, face = "bold"))
## Combine above plots into Fig 2a
fig_2a <- cowplot::plot_grid(age_plot_filt, num_mut_plot_filt,
sig_decomp_plot_filt, num_clones_plot_filt,
nrow = 1, rel_widths = c(3, 4, 4, 6), align = "h",
axis = "t", scale = c(1, 0.988, 0.988, 0.988))
ggsave("figures/overview_lines/overview_lines_BuPu.png",
plot = fig_2a, width = 30, height = 14)
fig_2a
Finally, we plot the cell assignment rate (from cardelino
) against the minimum Hamming distance (minimum number of variants distinguishing a pair of clones) for each line.
fig_2c_no_size <- ggplot(
donor_info_filt,
aes(y = as.numeric(percent_assigned_cells), x = as.numeric(min_hamming_dist),
fill = total_unst_cells)) +
geom_point(pch = 21, colour = "gray50", size = 4) +
scale_shape_manual(values = c(4, 19)) +
ylim(c(0, 1)) +
scale_x_log10() +
scale_fill_viridis(option = "magma") +
ylab("Proportion cells assigned") +
xlab("Minimum number of variants distinguishing clones") +
labs(title = "") +
labs(fill = "Number of cells") +
theme_bw() +
theme(text = element_text(size = 9,face = "bold"),
axis.text = element_text(size = 9, face = "bold"),
axis.title = element_text(size = 10, face = "bold"),
plot.title = element_text(size = 9, hjust = 0.5)) +
theme(strip.background = element_blank()) +
theme(legend.justification = c(1,0),
legend.position = c(1,0),
legend.direction = "horizontal",
legend.background = element_rect(fill = "white", linetype = 1,
colour = "black", size = 0.2),
legend.key.size = unit(0.25, "cm"))
ggsave("figures/overview_lines/cell_assignment_vs_min_hamming_dist.png",
plot = fig_2c_no_size, width = 12, height = 12, dpi = 300, units = "cm")
fig_2c_no_size
Save the donor info dataframe to output/line_info.tsv
.
write_tsv(donor_info_filt, path = "output/line_info.tsv", col_names = TRUE)
devtools::session_info()
setting value
version R version 3.5.1 (2018-07-02)
system x86_64, darwin15.6.0
ui X11
language (EN)
collate en_GB.UTF-8
tz Europe/London
date 2018-08-24
package * version date
AnnotationDbi * 1.42.1 2018-05-08
assertthat 0.2.0 2017-04-11
backports 1.1.2 2017-12-13
base * 3.5.1 2018-07-05
beeswarm 0.2.3 2016-04-25
bindr 0.1.1 2018-03-13
bindrcpp * 0.2.2 2018-03-29
Biobase * 2.40.0 2018-05-01
BiocGenerics * 0.26.0 2018-05-01
BiocParallel * 1.14.2 2018-07-08
Biostrings 2.48.0 2018-05-01
bit 1.1-14 2018-05-29
bit64 0.9-7 2017-05-08
bitops 1.0-6 2013-08-17
blob 1.1.1 2018-03-25
BSgenome 1.48.0 2018-05-01
BSgenome.Hsapiens.UCSC.hg19 1.4.0 2018-08-20
caTools 1.17.1.1 2018-07-20
clue 0.3-55 2018-04-23
cluster 2.0.7-1 2018-04-13
coda 0.19-1 2016-12-08
codetools 0.2-15 2016-10-05
colorspace 1.3-2 2016-12-14
compiler 3.5.1 2018-07-05
cowplot * 0.9.3 2018-07-15
crayon 1.3.4 2017-09-16
data.table 1.11.4 2018-05-27
datasets * 3.5.1 2018-07-05
DBI 1.0.0 2018-05-02
deconstructSigs * 1.8.0 2016-07-29
DelayedArray * 0.6.5 2018-08-15
DelayedMatrixStats 1.2.0 2018-05-01
devtools 1.13.6 2018-06-27
digest 0.6.15 2018-01-28
dplyr * 0.7.6 2018-06-29
DT 0.4 2018-01-30
dynamicTreeCut 1.63-1 2016-03-11
edgeR * 3.22.3 2018-06-21
evaluate 0.11 2018-07-17
FNN 1.1.2.1 2018-08-10
gdata 2.18.0 2017-06-06
GenomeInfoDb * 1.16.0 2018-05-01
GenomeInfoDbData 1.1.0 2018-04-25
GenomicAlignments 1.16.0 2018-05-01
GenomicRanges * 1.32.6 2018-07-20
ggbeeswarm 0.6.0 2017-08-07
ggforce * 0.1.3 2018-07-07
ggplot2 * 3.0.0 2018-07-03
ggrepel * 0.8.0 2018-05-09
ggridges * 0.5.0 2018-04-05
git2r 0.23.0 2018-07-17
glue 1.3.0 2018-07-17
gplots * 3.0.1 2016-03-30
graphics * 3.5.1 2018-07-05
grDevices * 3.5.1 2018-07-05
grid 3.5.1 2018-07-05
gridExtra 2.3 2017-09-09
gtable 0.2.0 2016-02-26
gtools 3.8.1 2018-06-26
hms 0.4.2 2018-03-10
htmltools 0.3.6 2017-04-28
htmlwidgets 1.2 2018-04-19
httpuv 1.4.5 2018-07-19
igraph 1.2.2 2018-07-27
inline 0.3.15 2018-05-18
IRanges * 2.14.10 2018-05-16
KernSmooth 2.23-15 2015-06-29
knitr 1.20 2018-02-20
labeling 0.3 2014-08-23
later 0.7.3 2018-06-08
lattice 0.20-35 2017-03-25
lazyeval 0.2.1 2017-10-29
limma * 3.36.2 2018-06-21
locfit 1.5-9.1 2013-04-20
magrittr 1.5 2014-11-22
MASS 7.3-50 2018-04-30
Matrix 1.2-14 2018-04-13
matrixStats * 0.54.0 2018-07-23
memoise 1.1.0 2017-04-21
methods * 3.5.1 2018-07-05
mime 0.5 2016-07-07
munsell 0.5.0 2018-06-12
org.Hs.eg.db * 3.6.0 2018-05-15
parallel * 3.5.1 2018-07-05
pillar 1.3.0 2018-07-14
pkgconfig 2.0.2 2018-08-16
plyr 1.8.4 2016-06-08
promises 1.0.1 2018-04-13
purrr 0.2.5 2018-05-29
R.methodsS3 1.7.1 2016-02-16
R.oo 1.22.0 2018-04-22
R.utils 2.6.0 2017-11-05
R6 2.2.2 2017-06-17
RColorBrewer 1.1-2 2014-12-07
Rcpp * 0.12.18 2018-07-23
RCurl 1.95-4.11 2018-07-15
readr * 1.1.1 2017-05-16
reshape2 1.4.3 2017-12-11
rhdf5 2.24.0 2018-05-01
Rhdf5lib 1.2.1 2018-05-17
rjson 0.2.20 2018-06-08
rlang 0.2.2 2018-08-16
rmarkdown 1.10 2018-06-11
rprojroot 1.3-2 2018-01-03
Rsamtools 1.32.2 2018-07-03
RSQLite 2.1.1 2018-05-06
rstan 2.17.3 2018-01-20
rtracklayer 1.40.4 2018-08-10
S4Vectors * 0.18.3 2018-06-08
scales 1.0.0 2018-08-09
scater * 1.9.12 2018-08-03
scran * 1.8.4 2018-08-07
shiny 1.1.0 2018-05-17
sigfit * 1.1.0 2018-05-08
SingleCellExperiment * 1.2.0 2018-05-01
StanHeaders 2.17.2 2018-01-20
statmod 1.4.30 2017-06-18
stats * 3.5.1 2018-07-05
stats4 * 3.5.1 2018-07-05
stringi 1.2.4 2018-07-20
stringr 1.3.1 2018-05-10
SummarizedExperiment * 1.10.1 2018-05-11
tibble 1.4.2 2018-01-22
tidyselect 0.2.4 2018-02-26
tools 3.5.1 2018-07-05
tweenr 0.1.5 2016-10-10
tximport 1.8.0 2018-05-01
units 0.6-0 2018-06-09
utils * 3.5.1 2018-07-05
vipor 0.4.5 2017-03-22
viridis * 0.5.1 2018-03-29
viridisLite * 0.3.0 2018-02-01
whisker 0.3-2 2013-04-28
withr 2.1.2 2018-03-15
workflowr 1.1.1 2018-07-06
XML 3.98-1.16 2018-08-19
xtable 1.8-2 2016-02-05
XVector 0.20.0 2018-05-01
yaml 2.2.0 2018-07-25
zlibbioc 1.26.0 2018-05-01
source
Bioconductor
CRAN (R 3.5.0)
CRAN (R 3.5.0)
local
CRAN (R 3.5.0)
CRAN (R 3.5.0)
CRAN (R 3.5.0)
Bioconductor
Bioconductor
Bioconductor
Bioconductor
CRAN (R 3.5.0)
CRAN (R 3.5.0)
CRAN (R 3.5.0)
CRAN (R 3.5.0)
Bioconductor
Bioconductor
CRAN (R 3.5.0)
cran (@0.3-55)
CRAN (R 3.5.0)
cran (@0.19-1)
CRAN (R 3.5.1)
CRAN (R 3.5.0)
local
CRAN (R 3.5.0)
CRAN (R 3.5.0)
CRAN (R 3.5.0)
local
CRAN (R 3.5.0)
CRAN (R 3.5.1)
Bioconductor
Bioconductor
CRAN (R 3.5.0)
CRAN (R 3.5.0)
CRAN (R 3.5.1)
CRAN (R 3.5.0)
CRAN (R 3.5.0)
Bioconductor
CRAN (R 3.5.0)
CRAN (R 3.5.0)
CRAN (R 3.5.0)
Bioconductor
Bioconductor
Bioconductor
Bioconductor
CRAN (R 3.5.0)
CRAN (R 3.5.0)
CRAN (R 3.5.0)
CRAN (R 3.5.0)
CRAN (R 3.5.0)
CRAN (R 3.5.0)
CRAN (R 3.5.0)
CRAN (R 3.5.0)
local
local
local
CRAN (R 3.5.0)
CRAN (R 3.5.0)
CRAN (R 3.5.0)
CRAN (R 3.5.0)
CRAN (R 3.5.0)
CRAN (R 3.5.0)
CRAN (R 3.5.0)
CRAN (R 3.5.0)
CRAN (R 3.5.0)
Bioconductor
CRAN (R 3.5.1)
CRAN (R 3.5.0)
CRAN (R 3.5.0)
CRAN (R 3.5.0)
CRAN (R 3.5.1)
CRAN (R 3.5.0)
Bioconductor
CRAN (R 3.5.0)
CRAN (R 3.5.0)
CRAN (R 3.5.1)
CRAN (R 3.5.1)
CRAN (R 3.5.0)
CRAN (R 3.5.0)
local
CRAN (R 3.5.0)
CRAN (R 3.5.0)
Bioconductor
local
CRAN (R 3.5.0)
CRAN (R 3.5.0)
CRAN (R 3.5.0)
CRAN (R 3.5.0)
CRAN (R 3.5.0)
CRAN (R 3.5.0)
CRAN (R 3.5.0)
CRAN (R 3.5.0)
CRAN (R 3.5.0)
CRAN (R 3.5.0)
CRAN (R 3.5.0)
CRAN (R 3.5.0)
CRAN (R 3.5.0)
CRAN (R 3.5.0)
Bioconductor
Bioconductor
CRAN (R 3.5.0)
CRAN (R 3.5.0)
CRAN (R 3.5.0)
CRAN (R 3.5.0)
Bioconductor
CRAN (R 3.5.0)
cran (@2.17.3)
Bioconductor
Bioconductor
CRAN (R 3.5.0)
Bioconductor
Bioconductor
CRAN (R 3.5.0)
Github (kgori/sigfit@55cad41)
Bioconductor
cran (@2.17.2)
CRAN (R 3.5.0)
local
local
CRAN (R 3.5.0)
CRAN (R 3.5.0)
Bioconductor
CRAN (R 3.5.0)
CRAN (R 3.5.0)
local
CRAN (R 3.5.0)
Bioconductor
CRAN (R 3.5.0)
local
CRAN (R 3.5.0)
CRAN (R 3.5.0)
CRAN (R 3.5.0)
CRAN (R 3.5.0)
CRAN (R 3.5.0)
CRAN (R 3.5.0)
CRAN (R 3.5.1)
CRAN (R 3.5.0)
Bioconductor
CRAN (R 3.5.1)
Bioconductor
This reproducible R Markdown analysis was created with workflowr 1.1.1