Last updated: 2018-11-11
workflowr checks: (Click a bullet for more information) ✔ R Markdown file: up-to-date
Great! Since the R Markdown file has been committed to the Git repository, you know the exact version of the code that produced these results.
✔ Repository version: 4db2338
wflow_publish
or wflow_git_commit
). workflowr only checks the R Markdown file, but you know if there are other scripts or data files that it depends on. Below is the status of the Git repository when the results were generated:
Ignored files:
Ignored: .DS_Store
Ignored: .Rhistory
Ignored: .Rproj.user/
Ignored: data/.DS_Store
Untracked files:
Untracked: analysis/chipexoeg.Rmd
Untracked: analysis/talk1011.Rmd
Untracked: data/chipexo_examples/
Untracked: data/chipseq_examples/
Untracked: docs/figure/vstiter.Rmd/
Untracked: talk.Rmd
Untracked: talk.pdf
Unstaged changes:
Modified: analysis/literature.Rmd
Modified: analysis/sigma.Rmd
Note that any generated files, e.g. HTML, png, CSS, etc., are not included in this status report because it is ok for generated content to have uncommitted changes.
File | Version | Author | Date | Message |
---|---|---|---|---|
Rmd | 4db2338 | Dongyue Xie | 2018-11-11 | wflow_publish(c(“analysis/index.Rmd”, “analysis/vstiter.Rmd”, “analysis/fda.Rmd”)) |
html | e55f4a7 | Dongyue Xie | 2018-10-18 | Build site. |
Rmd | 83d5406 | Dongyue Xie | 2018-10-18 | wflow_publish(“analysis/index.Rmd”) |
html | 9916bf6 | Dongyue Xie | 2018-10-07 | Build site. |
Rmd | 86d1b4b | Dongyue Xie | 2018-10-07 | add |
html | 3ce9535 | Dongyue Xie | 2018-10-05 | Build site. |
html | ba619d6 | Dongyue Xie | 2018-10-02 | Build site. |
Rmd | d7c4a01 | Dongyue Xie | 2018-10-02 | revise |
html | eaee67b | Dongyue Xie | 2018-10-02 | Build site. |
Rmd | eddcacf | Dongyue Xie | 2018-10-02 | revise |
html | b3c65ef | Dongyue | 2018-06-03 | chip seq data analysis |
Rmd | 7095e13 | Dongyue | 2018-06-03 | chip seq data analysis |
html | 54238d8 | Dongyue | 2018-06-03 | missing data |
Rmd | 3aa5ca5 | Dongyue | 2018-06-03 | missing data |
html | 549140f | Dongyue | 2018-05-30 | covariate iterative |
Rmd | 38e8063 | Dongyue | 2018-05-30 | covariate iterative |
html | 4b0e6c4 | Dongyue | 2018-05-25 | edit |
Rmd | d810c3e | Dongyue | 2018-05-25 | edit |
html | 568cf1a | Dongyue | 2018-05-24 | edit |
Rmd | 35cc4d6 | Dongyue | 2018-05-24 | edit |
Rmd | ec87323 | Dongyue | 2018-05-24 | edit |
html | 511adb2 | Dongyue | 2018-05-19 | wave basis |
Rmd | 0f77e70 | Dongyue | 2018-05-19 | wave basis |
html | 4b2e5d1 | Dongyue | 2018-05-17 | add known version |
Rmd | bdddcf6 | Dongyue | 2018-05-14 | edit |
Rmd | 7ee9791 | Dongyue | 2018-05-14 | edit |
html | 5f9b5c6 | Dongyue | 2018-05-09 | edit |
Rmd | fda2411 | Dongyue | 2018-05-09 | one iteration ash poisson |
html | 3a42238 | Dongyue | 2018-05-08 | correction |
Rmd | 910bc07 | Dongyue | 2018-05-08 | correction |
html | cb91cb1 | Dongyue | 2018-05-08 | add robust |
Rmd | 2e73919 | Dongyue | 2018-05-08 | add robust |
html | 44d3413 | Dongyue | 2018-05-07 | correct mu_t+E(u_t) |
Rmd | b082c1b | Dongyue | 2018-05-07 | correct mu_t+E(u_t) |
Rmd | ee44935 | Dongyue | 2018-05-06 | unknown sigma version |
html | 767077f | Dongyue | 2018-05-06 | unknown sigma version |
html | 341e471 | Dongyue | 2018-05-05 | edit |
html | c72247b | Dongyue | 2018-05-05 | edit |
Rmd | 3bd6a61 | Dongyue | 2018-05-05 | first commit |
html | ca322e8 | Dongyue | 2018-05-05 | first commit |
Rmd | b7e89a3 | DongyueXie | 2018-05-01 | Start workflowr project. |
We generalize smash
(Xing and Stephens, 2016), a flexible empirical Bayes method for signal denoising, to deal with non-Gaussian data, typically Poisson and Binomial sequence data.
This R package contains functions used in this project.
This is a new review and summary I wrote in Sept 2018.
This section contains analysis from very early stage of the project. At this stage, we assume the nugget effect is known.
We were trying to use sample mean as the initial estimate of parameter to formulate \(y_t\) and doing iterative algorithm. The problem is that the algorithm does not converge sometimes especially when there is outliers(see 2). One way to solve this is to set the finest level coefficients to zero so that the outliers are kind of removed(at least not very extreme any more). This indeed seems solving the problem but maybe this is a kind of ‘cheating’?
Another way is try to get rid of iterations and use only one iteration. But obviously it does not work well since \(\Sigma_tx_t/T\) is not a good approximation of parameters. Finally, we decided to expand around posterior mean from ash
(see 5 below). The default choice is to use a grid of uniform priors on the parameter(\(\lambda\) in Poi(\(\lambda\)) and \(p\) in Binomial(n,p)).
Nugget effect is unknown in practice. To apply our smashgen method, we need to know \(\sigma^2+s_t^2\). My initial idea is that we can either estimate \(\sigma^2\) or \(\sigma^2+s_t^2\) together. The method proposed for estimating nugget effect works well as shown in 1 below. Estimating \(\sigma^2\) first then plus \(s_t^2\) gives better estimation than estimating \(\sigma^2+s_t^2\) together(\(*\)). The phenomena I observed is that though the fact \((*)\), smash.gaus
gives very similar estimate of \(\mu_t\) in terms of MSE. This is because smash.gaus
is sensitive to the scale of variance instead of the shape of variance. As long as n is large enough so that smash can give good estimation of scale of variance and roughly satisfactory shape, then smash.gaus can still give similar mean estimation(see 2 below).
Now we are ready to apply smashgen to Poisson data with unknown nugget effect. It’s very important to note that when the range of mean function is small, the \(SNR=\frac{sd(mu)}{mean[sd(noise)]}\) could be very large. For example, if the mean function is spike and has range [1,3]. Then the stand deviation of mean function is around 0.48 and the average of \(s_t^2=1/e^\mu_t\) is around 0.92. So even without nugget effect, the SNR is already about 0.52. Adding the nugget effect would make it smaller. SNR in this situation matters to smashgen because we are using Gaussian approximation to it so a very low SNR would make it very difficult to recover the true mean. This is the reason why smashgen performs poorly when the range of mean function is small.
From 3, when the scale of mean function is small(smallest around 0.1-0.3), smash outperforms smashgen under all the mean functions. From 4, when we increased the scale of mean function(smallest around 50, largest around 200), smashgen outperforms smash for almost all the functions except Doppler. Smashgen cannot capture the high frequency fluctuation of Doppler at the beginning. For Heavysine, Parabola, Wave and timeshifted mean functions, using symmlet8 basis outperforms using Haar basis.
In 5, we compared smash and smashgen on estimating on estimating \(\mu\) and \(\log\mu\). For step function, smashgen gives better estimate of \(\log\mu\) but worse estimation of \(\mu\). For heavysine, smash and smashgen are similar while smashgen with symmlet8 achieved lower MSE when mean function range is (50,100). For Doppler, smashgen with symmlet8 achieved lower MSE when mean function range is (50,100). For parabola, the two methods are similar. For wave, smashgen with symmlet8 won. It’s very interesting to see that though smashgen gave smaller MSE for estimating \(\log\mu\), the MSE for estimating \(\mu\) became larger, for example in step and bump functions.
We extend smashgen to allow it smoothing probabilities. Simulations in 1 show that smashgen gives reasonable fits. When n is large and p is small in Binomial(n,p), we can use Poisson to approximate binomial distribution. In 2, we compared smashgen-binomial and smashgen-poi.binomial, as well as other popular smoothing methods(ti.thresh and eb.thresh). In general, smashgen-binomial outperforms the other methods. While we expect that as \(n_t\) increases smashgen-poi.binomial should give smaller MSE, this is not the case.
Now suppose at each \(t\), \(Y_t=X_t\beta+\mu_t+\epsilon_t\), where \(\mu\) has smooth structure and \(\epsilon_t\sim N(0,\sigma^2_t)\). The structure of \(\mu\) cannot be explained by the ordinary least square so it is contained in the residual \(e\). Thus \(e\) consists of \(\mu\) and noises. Using smash.gaus
recovers \(\mu\) and estimates \(\sigma^2\).
We treat unevenly spaced data as missing and set them to 0 with corresponding standard error \(10^6\). The idea is that if standard error is very big then value of y becomes irrelevant. It doesn’t work.
Some real data applications of smashgen. One problem is that the length of sequence is not a power of 2 so have to either augment or cut the sequences. For augmenting data, I tried to reflect two tails of data and for cutting data, I tried to retain the parts that are obviously not zeros. In conclusion, smashgen gives more smoothing fit in general. Two problems: is it over-smoothing? how to solve the peak problem?
The primary role of CTCF is thought to be in regulating the 3D structure of chromatin.CTCF binds together strands of DNA, thus forming chromatin loops, and anchors DNA to cellular structures like the nuclear lamina. It also defines the boundaries between active and heterochromatic DNA.
I’m focusing on reading 1. additive models(gam, gamm, spam, gspam); 2. functional data analysis(wavelet based functional mixed models, etc); 3. More on exponential family Signal denoising(vst, tf)
This project is based on the ideas from Professor Matthew Stephens. Thanks to Matthew Stephens and Kushal K Dey for their great help.
This reproducible R Markdown analysis was created with workflowr 1.1.1