Last updated: 2018-11-11

workflowr checks: (Click a bullet for more information)

✔ R Markdown file: up-to-date

Great! Since the R Markdown file has been committed to the Git repository, you know the exact version of the code that produced these results.
✔ Repository version: 4db2338
Great! You are using Git for version control. Tracking code development and connecting the code version to the results is critical for reproducibility. The version displayed above was the version of the Git repository at the time these results were generated.

Note that you need to be careful to ensure that all relevant files for the analysis have been committed to Git prior to generating the results (you can use wflow_publish or wflow_git_commit). workflowr only checks the R Markdown file, but you know if there are other scripts or data files that it depends on. Below is the status of the Git repository when the results were generated:
```
Ignored files:
    Ignored:    .DS_Store
    Ignored:    .Rhistory
    Ignored:    .Rproj.user/
    Ignored:    data/.DS_Store

Untracked files:
    Untracked:  analysis/chipexoeg.Rmd
    Untracked:  analysis/talk1011.Rmd
    Untracked:  data/chipexo_examples/
    Untracked:  data/chipseq_examples/
    Untracked:  docs/figure/vstiter.Rmd/
    Untracked:  talk.Rmd
    Untracked:  talk.pdf

Unstaged changes:
    Modified:   analysis/literature.Rmd
    Modified:   analysis/sigma.Rmd
```
Note that any generated files, e.g. HTML, png, CSS, etc., are not included in this status report because it is ok for generated content to have uncommitted changes.

Expand here to see past versions:

File	Version	Author	Date	Message
Rmd	4db2338	Dongyue Xie	2018-11-11	wflow_publish(c(“analysis/index.Rmd”, “analysis/vstiter.Rmd”, “analysis/fda.Rmd”))
html	e55f4a7	Dongyue Xie	2018-10-18	Build site.
Rmd	83d5406	Dongyue Xie	2018-10-18	wflow_publish(“analysis/index.Rmd”)
html	9916bf6	Dongyue Xie	2018-10-07	Build site.
Rmd	86d1b4b	Dongyue Xie	2018-10-07	add
html	3ce9535	Dongyue Xie	2018-10-05	Build site.
html	ba619d6	Dongyue Xie	2018-10-02	Build site.
Rmd	d7c4a01	Dongyue Xie	2018-10-02	revise
html	eaee67b	Dongyue Xie	2018-10-02	Build site.
Rmd	eddcacf	Dongyue Xie	2018-10-02	revise
html	b3c65ef	Dongyue	2018-06-03	chip seq data analysis
Rmd	7095e13	Dongyue	2018-06-03	chip seq data analysis
html	54238d8	Dongyue	2018-06-03	missing data
Rmd	3aa5ca5	Dongyue	2018-06-03	missing data
html	549140f	Dongyue	2018-05-30	covariate iterative
Rmd	38e8063	Dongyue	2018-05-30	covariate iterative
html	4b0e6c4	Dongyue	2018-05-25	edit
Rmd	d810c3e	Dongyue	2018-05-25	edit
html	568cf1a	Dongyue	2018-05-24	edit
Rmd	35cc4d6	Dongyue	2018-05-24	edit
Rmd	ec87323	Dongyue	2018-05-24	edit
html	511adb2	Dongyue	2018-05-19	wave basis
Rmd	0f77e70	Dongyue	2018-05-19	wave basis
html	4b2e5d1	Dongyue	2018-05-17	add known version
Rmd	bdddcf6	Dongyue	2018-05-14	edit
Rmd	7ee9791	Dongyue	2018-05-14	edit
html	5f9b5c6	Dongyue	2018-05-09	edit
Rmd	fda2411	Dongyue	2018-05-09	one iteration ash poisson
html	3a42238	Dongyue	2018-05-08	correction
Rmd	910bc07	Dongyue	2018-05-08	correction
html	cb91cb1	Dongyue	2018-05-08	add robust
Rmd	2e73919	Dongyue	2018-05-08	add robust
html	44d3413	Dongyue	2018-05-07	correct mu_t+E(u_t)
Rmd	b082c1b	Dongyue	2018-05-07	correct mu_t+E(u_t)
Rmd	ee44935	Dongyue	2018-05-06	unknown sigma version
html	767077f	Dongyue	2018-05-06	unknown sigma version
html	341e471	Dongyue	2018-05-05	edit
html	c72247b	Dongyue	2018-05-05	edit
Rmd	3bd6a61	Dongyue	2018-05-05	first commit
html	ca322e8	Dongyue	2018-05-05	first commit
Rmd	b7e89a3	DongyueXie	2018-05-01	Start workflowr project.

Project Overview

We generalize smash(Xing and Stephens, 2016), a flexible empirical Bayes method for signal denoising, to deal with non-Gaussian data, typically Poisson and Binomial sequence data.

This R package contains functions used in this project.

Analysis

A general introduction and method

This is a new review and summary I wrote in Sept 2018.

Introduction and method

Intial investigation

This section contains analysis from very early stage of the project. At this stage, we assume the nugget effect is known.

We were trying to use sample mean as the initial estimate of parameter to formulate \(y_t\) and doing iterative algorithm. The problem is that the algorithm does not converge sometimes especially when there is outliers(see 2). One way to solve this is to set the finest level coefficients to zero so that the outliers are kind of removed(at least not very extreme any more). This indeed seems solving the problem but maybe this is a kind of ‘cheating’?

Another way is try to get rid of iterations and use only one iteration. But obviously it does not work well since \(\Sigma_tx_t/T\) is not a good approximation of parameters. Finally, we decided to expand around posterior mean from ash(see 5 below). The default choice is to use a grid of uniform priors on the parameter(\(\lambda\) in Poi(\(\lambda\)) and \(p\) in Binomial(n,p)).

review
Poisson nugget simulation: assume \(\sigma\) is known.
A robust version of smash-gen: set the highest resolution wavelet coeffs to 0.
One iteration version: only do 1 iteration.
Expanding around ash posterior mean: iterate once but the Taylor series expansion is around ash posterior mean(applying ash to Poisson data first).

Estimate nugget effect

Nugget effect is unknown in practice. To apply our smashgen method, we need to know \(\sigma^2+s_t^2\). My initial idea is that we can either estimate \(\sigma^2\) or \(\sigma^2+s_t^2\) together. The method proposed for estimating nugget effect works well as shown in 1 below. Estimating \(\sigma^2\) first then plus \(s_t^2\) gives better estimation than estimating \(\sigma^2+s_t^2\) together(\(*\)). The phenomena I observed is that though the fact \((*)\), smash.gaus gives very similar estimate of \(\mu_t\) in terms of MSE. This is because smash.gaus is sensitive to the scale of variance instead of the shape of variance. As long as n is large enough so that smash can give good estimation of scale of variance and roughly satisfactory shape, then smash.gaus can still give similar mean estimation(see 2 below).

Poisson data with unknown nugget effect

Now we are ready to apply smashgen to Poisson data with unknown nugget effect. It’s very important to note that when the range of mean function is small, the \(SNR=\frac{sd(mu)}{mean[sd(noise)]}\) could be very large. For example, if the mean function is spike and has range [1,3]. Then the stand deviation of mean function is around 0.48 and the average of \(s_t^2=1/e^\mu_t\) is around 0.92. So even without nugget effect, the SNR is already about 0.52. Adding the nugget effect would make it smaller. SNR in this situation matters to smashgen because we are using Gaussian approximation to it so a very low SNR would make it very difficult to recover the true mean. This is the reason why smashgen performs poorly when the range of mean function is small.

From 3, when the scale of mean function is small(smallest around 0.1-0.3), smash outperforms smashgen under all the mean functions. From 4, when we increased the scale of mean function(smallest around 50, largest around 200), smashgen outperforms smash for almost all the functions except Doppler. Smashgen cannot capture the high frequency fluctuation of Doppler at the beginning. For Heavysine, Parabola, Wave and timeshifted mean functions, using symmlet8 basis outperforms using Haar basis.

In 5, we compared smash and smashgen on estimating on estimating \(\mu\) and \(\log\mu\). For step function, smashgen gives better estimate of \(\log\mu\) but worse estimation of \(\mu\). For heavysine, smash and smashgen are similar while smashgen with symmlet8 achieved lower MSE when mean function range is (50,100). For Doppler, smashgen with symmlet8 achieved lower MSE when mean function range is (50,100). For parabola, the two methods are similar. For wave, smashgen with symmlet8 won. It’s very interesting to see that though smashgen gave smaller MSE for estimating \(\log\mu\), the MSE for estimating \(\mu\) became larger, for example in step and bump functions.

Poisson nugget simulation(unknown \(\sigma\)): \(\sigma\) is unknown.
Haar vs Symmlet 8: a comparison of different wavelet basis.
Poisson data, wavelet basis functions: no nugget effect.
Poisson data, wavelet basis functions:larger mean function
Poisson seq, compare smash and smashgen on the estimate of \(\mu\) and \(\log\mu\)
Fix spike issues: updated version, 10/07/2018

Binomial data with unknown nugget effect

We extend smashgen to allow it smoothing probabilities. Simulations in 1 show that smashgen gives reasonable fits. When n is large and p is small in Binomial(n,p), we can use Poisson to approximate binomial distribution. In 2, we compared smashgen-binomial and smashgen-poi.binomial, as well as other popular smoothing methods(ti.thresh and eb.thresh). In general, smashgen-binomial outperforms the other methods. While we expect that as \(n_t\) increases smashgen-poi.binomial should give smaller MSE, this is not the case.

Smoothing with covariates

Now suppose at each \(t\), \(Y_t=X_t\beta+\mu_t+\epsilon_t\), where \(\mu\) has smooth structure and \(\epsilon_t\sim N(0,\sigma^2_t)\). The structure of \(\mu\) cannot be explained by the ordinary least square so it is contained in the residual \(e\). Thus \(e\) consists of \(\mu\) and noises. Using smash.gaus recovers \(\mu\) and estimates \(\sigma^2\).

Unevenly spaced data

We treat unevenly spaced data as missing and set them to 0 with corresponding standard error \(10^6\). The idea is that if standard error is very big then value of y becomes irrelevant. It doesn’t work.

Missing data?

Chip-Exo and Chip-seq data smoothing

Some real data applications of smashgen. One problem is that the length of sequence is not a power of 2 so have to either augment or cut the sequences. For augmenting data, I tried to reflect two tails of data and for cutting data, I tried to retain the parts that are obviously not zeros. In conclusion, smashgen gives more smoothing fit in general. Two problems: is it over-smoothing? how to solve the peak problem?

The primary role of CTCF is thought to be in regulating the 3D structure of chromatin.CTCF binds together strands of DNA, thus forming chromatin loops, and anchors DNA to cellular structures like the nuclear lamina. It also defines the boundaries between active and heterochromatic DNA.

CTCF Chip-exo data, reflection: use symmetric reflection to extend sequence.
CTCF Chip-exo data, cut: cut the sequence.
CTCF Chip-seq data

DNA methylation data smoothing

BS Cancer data: DNA methylation data from normal and cancer.

Variance stabilizing transformation(vst)

Literatures on smoothing

I’m focusing on reading 1. additive models(gam, gamm, spam, gspam); 2. functional data analysis(wavelet based functional mixed models, etc); 3. More on exponential family Signal denoising(vst, tf)

Credits

This project is based on the ideas from Professor Matthew Stephens. Thanks to Matthew Stephens and Kushal K Dey for their great help.

This reproducible R Markdown analysis was created with workflowr 1.1.1

Smash-gen