Rmosek
: RegularizationLast updated: 2017-05-09
Code version: b3932c7
The \(w\) optimization problem we hope to solve has two constraints. If we are not imposing these two constraints, chances are the optimization will be unstable, especially when we don’t know \(L\) for sure beforehand. However, these two constraints are hard to be converted directly and strictly to convex or even tractable forms.
\[ \begin{array}{rl} \min\limits_{w} & -\sum\limits_{j = 1}^n\log\left(\sum\limits_{l=1}^Lw_l\left(\sum\limits_{k = 1}^K\hat\pi_k f_{jkl}\right) + \sum\limits_{k = 1}^K\hat\pi_kf_{jk0}\right) + \sum\limits_{l = 1}^L\lambda_l^w \phi \left(\left|w_l\right|\right) \\ \text{subject to} & \sum\limits_{l=1}^L w_l \varphi^{(l)}\left(z\right) + \varphi\left(z\right) \geq 0, \forall z\in \mathbb{R}\\ & w_l \text{ decay reasonably fast.} \end{array} \]
One observation is that without the constraints, the optimization goes unstable as \(\hat w\) get uncontrollably large. Therefore, a natural idea to replace these two constraints would be to bound, penalize, or regularize \(w\), to prevent them from being too large.
On the other hand, as indicated the theory and examples of good fitting by Gaussian derivatives, \(w_l\) is getting smaller as \(l\) increases. Moreover, oftentimes, really small \(w_l\) could still make a difference and thus indispensable. Therefore, although we need to stop \(\hat w\) getting too large, we cerntainly don’t want to shrink them unnecessarily and unwarrantedly to zero.
Ideally, the goal of \(\sum\limits_{l = 1}^L\lambda_l^w \phi \left(\left|w_l\right|\right)\) regularizing \(w\) should be to
Remember that in theory, if we are writing the random empirical density of the correlated marginally \(N\left(0, 1\right)\) random samples as
\[ f_0\left(z\right) = \varphi\left(z\right) + \sum\limits_{l = 1}^\infty W_k\varphi^{(l)}\left(z\right) \ , \] then
\[ \text{var}\left(W_l\right) = \frac{\alpha_l}{l!} \ , \] where
\[ \alpha_l = \frac{1}{\frac{n\left(n - 1\right)}{2}}\sum_\limits{i < j}\rho_{ij}^l := \bar{\rho_{ij}^l} \ . \]
Therefore, naturally \(w_l\) should decay somehow in the order of
\[ \left|w_l\right| \leadsto \sqrt{\text{var}\left(W_l\right)} = \frac{\sqrt{\bar{\rho_{ij}^l}}}{\sqrt{l!}} \ . \]
The order of decay suggests that we should work with normalized coefficients, so that
\[ \left|w_l^n\right| = \sqrt{l!}\left|w_l\right| \leadsto \sqrt{\bar{\rho_{ij}^l}} \ . \] This piece of information on the order of magnitude of the normalized \(w_l^n\) provides a hint to determine \(\lambda_l^w\), for example,
Odd orders of Gaussian derivatives are generally associated with mean shift and skewness of \(f_0\), so they are generally very small, but if they are indeed not zero, they are important however they are small.
Meanwhile, when \(l\) is odd, \(\bar{\rho_{ij}^l}\) could be very small, and not in order of \(\bar{\rho^l}\) for some \(\rho \in \left(0, 1\right)\), so it’s very difficult to come up with a good \(\lambda_l^w\) when \(l\) is odd.
Therefore, it might be better to leave the odd orders alone and regularize the even orders only.
Generally speaking, \(l_1\) imposes sparsity by shrinking small estimates exactly to zero, which is both good and bad for our case, whereas \(l_2\) penalizes large estimates severely but doesn’t force exact sparsity, which could also be good or bad.
We’ll implement both and see what happens.
\(\lambda_l^w\) is determined to help ensure
\[ \left|w_l^n\right| = \sqrt{l!}\left|w_l\right| \leadsto \sqrt{\bar{\rho_{ij}^l}} \ , \]
Given \(\rho \in \left(0, 1\right)\), for \(l_1\) regularization, \(\lambda_l^n \sim \lambda / \sqrt{\rho^l}\); for \(l_2\) regularization, \(\lambda_l^n \sim \lambda / \rho^l\).
So far, \(L = 9\) has been able to handle all the example we’ve tried. Without constraints or regularization, \(L = 9\) is usually too large for cases where, say, \(L = 4\) is enough, and a larger than necessary \(L\) could lead to numerical instability.
Hence, we’ll experiment with \(L = 12\), assuming it’s enough for all the correlation induced distortions in practice, and assuming the regularization could prevent the optimization from going crazy.
The \(w\) optimization problem becomes
\[ \begin{array}{rl} \min\limits_{w, g} & -\sum\limits_{j = 1}^n \log\left(g_j\right) + \text{Regularization} \\ \text{subject to} & Aw + a = g \\ & g \geq 0\ , \end{array} \] where \(A_{n \times L} = \left[a_1,\ldots, a_L\right]\) and \(a\) are computed with normalized Gaussian derivatives and Hermite polynomials.
Let \(\lambda > 0\), \(\rho \in \left(0, 1\right)\), the regularization term has two forms as follows.
\[ \begin{array}{l} \sum\limits_{l = 1}^L\lambda_l^{w^s}\left|w_l^s\right| \ ,\\ \lambda_l^{w^s} = \begin{cases} 0, & l \text{ is odd;} \\ \lambda / \rho^{l/2}, & l \text{ is even.} \end{cases} \end{array} \]
\[ \begin{array}{l} \sum\limits_{l = 1}^L \lambda_l^{w^s}{w_l^s}^2 \ ,\\ \lambda_l^{w^s} = \begin{cases} 0, & l \text{ is odd;} \\ \lambda / \rho^{l}, & l \text{ is even.} \end{cases} \end{array} \]
The primal form is
\[ \begin{array}{rl} \min\limits_{w, g} & -\sum\limits_{j = 1}^n \log\left(g_j\right) + \sum\limits_{l = 1}^L\lambda_l^{w^s}\left|w_l^s\right| \\ \text{subject to} & Aw + a = g \\ & g \geq 0\ , \end{array} \] The dual form is
\[ \begin{array}{rl} \min\limits_{v} & -\sum\limits_{j = 1}^n \log\left(v_j\right) + a^Tv - n \\ \text{subject to} & -\lambda \leq A^Tv \leq \lambda \\ & v \geq 0\ . \end{array} \]
The primal form is
\[ \begin{array}{rl} \min\limits_{w, g} & -\sum\limits_{j = 1}^n \log\left(g_j\right) + \sum\limits_{l = 1}^L\lambda_l^{w^s}{w_l^s}^2 \\ \text{subject to} & Aw + a = g \\ & g \geq 0\ , \end{array} \] The dual form is
\[ \begin{array}{rl} \min\limits_{v} & -\sum\limits_{j = 1}^n \log\left(v_j\right) + a^Tv - n + \frac14 v^T \left(A \Lambda^{-1} A^T\right)v \\ \text{subject to} & a_j^Tv = 0 \ \text{ if }\lambda_j = 0 \ ,\\ & v \geq 0\ , \end{array} \] where \(\Lambda = \begin{bmatrix}\lambda_1 & & \\ & \ddots & \\ & & \lambda_L \end{bmatrix}\), and \({\Lambda^{-1}}_{jj} = 0\) when \(\lambda_j = 0\).
Let’s start with \(\rho = 0.9\) and \(\lambda = 0.1\).
sessionInfo()
R version 3.3.3 (2017-03-06)
Platform: x86_64-apple-darwin13.4.0 (64-bit)
Running under: macOS Sierra 10.12.4
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
attached base packages:
[1] stats graphics grDevices utils datasets methods base
loaded via a namespace (and not attached):
[1] backports_1.0.5 magrittr_1.5 rprojroot_1.2 tools_3.3.3
[5] htmltools_0.3.5 yaml_2.1.14 Rcpp_0.12.10 stringi_1.1.2
[9] rmarkdown_1.3 knitr_1.15.1 git2r_0.18.0 stringr_1.2.0
[13] digest_0.6.11 evaluate_0.10
This R Markdown site was created with workflowr