Last updated: 2019-01-16
workflowr checks: (Click a bullet for more information) ✔ R Markdown file: up-to-date 
Great! Since the R Markdown file has been committed to the Git repository, you know the exact version of the code that produced these results.
 ✔ Environment: empty 
Great job! The global environment was empty. Objects defined in the global environment can affect the analysis in your R Markdown file in unknown ways. For reproduciblity it’s best to always run the code in an empty environment.
 ✔ Seed: 
set.seed(20180501) 
The command set.seed(20180501) was run prior to running the code in the R Markdown file. Setting a seed ensures that any results that rely on randomness, e.g. subsampling or permutations, are reproducible.
 ✔ Session information: recorded 
Great job! Recording the operating system, R version, and package versions is critical for reproducibility.
 ✔ Repository version: 63ff335 
wflow_publish or wflow_git_commit). workflowr only checks the R Markdown file, but you know if there are other scripts or data files that it depends on. Below is the status of the Git repository when the results were generated:
Ignored files:
    Ignored:    .DS_Store
    Ignored:    .Rhistory
    Ignored:    .Rproj.user/
    Ignored:    data/.DS_Store
Untracked files:
    Untracked:  analysis/chipexoeg.Rmd
    Untracked:  analysis/efsd.Rmd
    Untracked:  analysis/talk1011.Rmd
    Untracked:  data/chipexo_examples/
    Untracked:  data/chipseq_examples/
    Untracked:  talk.Rmd
    Untracked:  talk.pdf
Unstaged changes:
    Modified:   analysis/binomial.Rmd
    Modified:   analysis/fda.Rmd
    Modified:   analysis/index.Rmd
    Modified:   analysis/sigma.Rmd
Note that any generated files, e.g. HTML, png, CSS, etc., are not included in this status report because it is ok for generated content to have uncommitted changes. 
| File | Version | Author | Date | Message | 
|---|---|---|---|---|
| Rmd | 63ff335 | Dongyue Xie | 2019-01-16 | wflow_publish(“analysis/r2.Rmd”) | 
| html | 1d90590 | Dongyue Xie | 2019-01-15 | Build site. | 
| Rmd | 3f6cd72 | Dongyue Xie | 2019-01-15 | wflow_publish(“analysis/r2.Rmd”) | 
| html | 265a2bb | Dongyue Xie | 2019-01-15 | Build site. | 
| Rmd | 15730f5 | Dongyue Xie | 2019-01-15 | wflow_publish(“analysis/r2.Rmd”) | 
In multiple linear regression \(y=X\beta+\epsilon\), where \(y\in R^n\), \(X\in R^{n\times p}\) whose first column is a 1 vector, and \(\epsilon\sim N(0,\sigma^2I_n)\).
Definition of ANOVA terms:
Regression sum of squares \(SSR=\Sigma_i(\hat y_i-\bar y)^2=y^T(H-\frac{1}{n}11^T)y\). df=p-1
\(MSE=\frac{SSE}{n-p}\), \(E(MSE)=\sigma^2\); \(MSR=\frac{SSR}{p-1}\), \(E(MSR)=\sigma^2+nonnegative.quantity\)
\(\frac{MSR}{MSE}\sim F_{df_1=p-1,df_2=n-p}\).
Definition of Coefficient of Multiple correlation \(R^2\):
The proportion of the total sum of squares due to regression is \(R^2=\frac{SSR}{SST}=1-\frac{SSE}{SST}\); Adjusted R squared proposed by Ezekiel (1930): \(R_a^2=1-\frac{n-1}{n-p}\frac{SSE}{SST}\), mainly to correct 1. Adding a variable x to the model increases \(R^2\); 2. When all \(\beta\)s except intercept are 0, \(E(R^2)=\frac{p-1}{n-1}\)
Rewrite adjusted \(R^2\) as \(R_a^2=1-\frac{n-1}{n-p}\frac{SSE}{SST}=1-\frac{SSE/(n-p)}{SST/(n-1)}=1-\frac{\hat\sigma_\epsilon^2}{\hat\sigma^2_y}\) where \(\hat\sigma_\epsilon^2\) is the estimate of \(\sigma^2\) and \(\hat\sigma^2_y\) is the estimated variance of \(y\). My understanding of \(\sigma^2_y\): if no model assumption but just view \(y\) standalone, \(\sigma^2_y\) is the ‘population’ variance of y.
Now we have a ratio of sample variances, which fits into fash frame work: \(\tilde F=\log\frac{\hat\sigma_\epsilon^2}{\hat\sigma^2_y}\sim \log\frac{\sigma_\epsilon^2}{\sigma^2_y}\times F_{df_1=n-p,df_2=n-1}\). fash shrinks \(\log\frac{\sigma_\epsilon^2}{\sigma^2_y}\) towards zero hence \(\frac{\sigma_\epsilon^2}{\sigma^2_y}\) towards 1 and so shrinks \(R^2\) towards 0.
Example:
Fix \(n=100\) and change \(p\) from 1 to 90. Here \(p\) is the dimension excluding intercept.
library(ashr)
set.seed(12345)
p.list=1:90
n=100
R2=c()
R2a=c()
R2s=c()
for (i in 1:length(p.list)) {
  p=p.list[i]
  X=matrix(rnorm(n*(p)),n,p)
  #X.data=cbind(rep(1,n),X)
  beta=rep(0,p)
  y=X%*%beta+rnorm(n)
  datax=data.frame(X=X,y=y)
  mod=lm(y~X,datax)
  mod.sy=summary(mod)
  R2[i]=mod.sy$r.squared
  R2a[i]=mod.sy$adj.r.squared
  
  mst=sum((y-mean(y))^2)/(n-1)
  mse=sum((y-fitted(mod))^2)/(n-p)
  aa=ash(log(mse/mst),1,lik=lik_logF(df1=n-p,df2=n-1))
  R2s[i]=1-exp(aa$result$PosteriorMean)
}
plot(p.list,R2,ylim=c(-0.8,1),main='beta=0',xlab='p',ylab='')
lines(p.list,R2a,type='p',pch=2)
lines(p.list,R2s,type='p',pch=8)
legend('topleft',c('R^2','Adjusted R^2','Shrinked R^2'),pch=c(1,2,8))

Fix \(n=100\) and change \(p\) from 1 to 90.
set.seed(12345)
R2=c()
R2a=c()
R2s=c()
for (i in 1:length(p.list)) {
  p=p.list[i]
  X=matrix(rnorm(n*(p)),n,p)
  #X.data=cbind(rep(1,n),X)
  beta=rep(1,p)
  y=X%*%beta+rnorm(n)
  datax=data.frame(X=X,y=y)
  mod=lm(y~X,datax)
  mod.sy=summary(mod)
  R2[i]=mod.sy$r.squared
  R2a[i]=mod.sy$adj.r.squared
  
  mst=sum((y-mean(y))^2)/(n-1)
  mse=sum((y-fitted(mod))^2)/(n-p)
  aa=ash(log(mse/mst),1,lik=lik_logF(df1=n-p,df2=n-1))
  R2s[i]=1-exp(aa$result$PosteriorMean)
}
plot(p.list,R2,ylim=c(0.7,1),main='beta=1',xlab='p',ylab='')
lines(p.list,R2a,type='p',pch=4)
lines(p.list,R2s,type='p',pch=8)
legend('bottomright',c('R^2','Adjusted R^2','Shrinked R^2'),pch=c(1,2,8))

Define $ F^*=$, then \(F^*=\frac{SSR/(p-1)}{SSE/(n-p)}\sim F_{df_1=p-1,df_2=n-p}\) when \(\beta_1,...,\beta_{p-1}\) are 0. Otherwise, \(F^*\) follows non-central F distribution whose non-central parameter is \((X\beta)^T(H-\frac{11^T}{n})(X\beta)\).
sessionInfo()
R version 3.5.1 (2018-07-02)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS High Sierra 10.13.6
Matrix products: default
BLAS: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRblas.0.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRlapack.dylib
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     
other attached packages:
[1] ashr_2.2-7
loaded via a namespace (and not attached):
 [1] Rcpp_0.12.18      knitr_1.20        whisker_0.3-2    
 [4] magrittr_1.5      workflowr_1.1.1   REBayes_1.3      
 [7] MASS_7.3-50       pscl_1.5.2        doParallel_1.0.14
[10] SQUAREM_2017.10-1 lattice_0.20-35   foreach_1.4.4    
[13] stringr_1.3.1     tools_3.5.1       parallel_3.5.1   
[16] grid_3.5.1        R.oo_1.22.0       git2r_0.23.0     
[19] htmltools_0.3.6   iterators_1.0.10  assertthat_0.2.0 
[22] yaml_2.2.0        rprojroot_1.3-2   digest_0.6.17    
[25] Matrix_1.2-14     codetools_0.2-15  R.utils_2.7.0    
[28] evaluate_0.11     rmarkdown_1.10    stringi_1.2.4    
[31] compiler_3.5.1    Rmosek_8.0.69     backports_1.1.2  
[34] R.methodsS3_1.7.1 truncnorm_1.0-8  
This reproducible R Markdown analysis was created with workflowr 1.1.1