Proximal Gradient Decent is a recent optimization technique based on:
PGD is also known as
%%tikz --scale 1 --size 1600,1600 -f png
\draw[->] (-1,0) -- (2.5,0) node[right] {\scriptsize $R^n$};
\draw[domain=-1.5:2.2, smooth, variable=\x, blue] plot({\x}, {.5+.5*\x*\x});
\draw[domain=-.9:2, thick, variable=\y, red] plot({\y}, {1 + (\y-1) + (\y-1)*(\y-1)});
\draw[dotted] (1,1) -- (1,0) node[below] {\tiny $x_k$};
\draw[dotted] (.5,.75) -- (.5,0) node[below] {\tiny $x_{k+1}$};
\fill (1,1) circle [radius=1pt];
\fill (.5,.75) circle [radius=1pt];
\draw[dashed,->] (1.8,1+.8+.8^2) -- (3, 1.5+.5*1.8^2) node[right] {$f(x) \approx f(x_k) + \nabla f(x_k)^T(x-x_k) + \frac{1}{2\alpha_k}\|x-x_k\|^2$};
Update: $$ x_{k+1} = x_k - \alpha_k g_k, \;\; g_k \in \partial f(x_k) $$
Keep $x_\text{best}$ ($f(x_{k+1}) \le f(x_k)$ is not guaranteed)
Convergence:
Suppose that the objective $f$ can be decomposed into two parts:
Ex. Convex regularization problems
After all, $f$ is nonsmooth
If $\Psi(x) = 0$, we can apply GD to do:
$ x_{k+1} = \argmin_{x} \; f(x_k) + \nabla f(x_k)^T (x-x_k) + \frac{1}{2\alpha_k} \|x-x_k\|^2 $
We consider a natural extension with a nontrivial $\Psi$:
$ x_{k+1} = \argmin_{x} \; F(x_k) + \nabla F(x_k)^T (x-x_k) + \frac{1}{2\alpha_k} \|x-x_k\|^2 + \Psi(x) $
$ x_{k+1} = \argmin_{x} \; F(x_k) + \nabla F(x_k)^T (x-x_k) + \frac{1}{2\alpha_k} \|x-x_k\|^2 + \Psi(x) $
Focus on: $ F(x_k) + \nabla F(x_k)^T (x-x_k) + \frac{1}{2\alpha_k} \|x-x_k\|^2 $
$\rightarrow \frac{1}{2\alpha_k} \left( \|x - x_k\|^2 + 2 \alpha_k\nabla F(x_k)^T x + \text{const} \right)$
$\rightarrow \frac{1}{2\alpha_k} \left( x^T x -2 (x_k - \alpha_k \nabla F(x_k))^T x + \text{const} \right)$
$\rightarrow \frac{1}{2\alpha_k} \| x - (x_k - \alpha_k \nabla F(x_k)) \|^2 + \text{const} $
Therefore, the generalized gradient descent updates can be written as $$ x_{k+1} = \argmin_{x} \; \left\{ \frac{1}{2\alpha_k}\| x - (x_k -\alpha_k \nabla F(x_k)) \|^2 + \Psi(x) \right\} $$
In general, we define the proximal operator
$$ \prox_{h}(z) := \argmin_{x\in\R^n} \; \left\{ \frac{1}{2} \|x - z\|^2 + h(x) \right\} $$Then the update rule becomes simply, $$ x_{k+1} = \prox_{\alpha_k \Psi} ( x_k - \alpha_k \nabla F(x_k)) $$
Consider that $h(x) = I_C(x) = \begin{cases} 0 & \text{if x $\in$ C} \\ +\infty & \text{o.w.} \end{cases}$ for a convex set $C$. What is the prox equivalent to?
For more properties: Proximal Algorithms, Parikh and Boyd, Foundations and Trends in Optimization, 2013
We can rewrite the expression as follows:
$$ x_{k+1} = x_k - \alpha_k \frac{x_k - \text{prox}_{\alpha_k \Psi} ( x_k - \alpha_k \nabla F(x_k))}{\alpha_k} $$And therefore, $$ x_{k+1} = x_k - \alpha_k G_{\alpha_k}^\Psi(x_k) $$
$G_{\alpha_k}^\Psi(x_k)$ can be seen as a generalized gradient
This can be seen as a two-steps procedure:
\begin{align*} x_{k+\frac{1}{2}} & = x_k -\alpha_k \nabla F(x_k)\\ x_{k+1} & = \argmin_{x} \; \left\{ \frac{1}{2\alpha_{k+\frac{1}{2}}}\| x - x_{k+\frac{1}{2}} \|^2 + \Psi(x) \right\} \end{align*}Therefore, for $g_\Psi(x_{k+1}) \in \partial \Psi(x_{k+1})$, $$ x_{k+1} = x_k - \underbrace{\alpha_k \nabla F(x_k)}_{\text{backward-looking}} + \underbrace{\alpha_{k+\frac{1}{2}}g_\Psi(x_{k+1})}_{\text{forward-looking}} $$
Each iteration needs to compute the prox operation:
$$ \text{prox}_{\Psi}(z) := \argmin_{x\in\R^n} \; \left\{ \frac{1}{2} \|x - z\|^2 + \Psi(x) \right\} $$We call $\Psi$ "simple", if its prox op can be solved efficiently
Many interesting regularizers $\Psi$ have this property
$ \text{prox}_{\alpha_k \lambda \|x\|_1}(z) := \argmin_{x\in\R^n} \; \left\{ \frac{1}{2} \|x - z\|^2 + \alpha_k \lambda \|x\|_1 \right\} $
$0 \in x^*-z + \alpha_k \lambda g, \;\; g \in \partial \|x^*\|_1$
$x_i^* = z_i - \alpha_k \lambda g_i, \;\; g_i = \begin{cases} +1 & x_i^* > 0 \\ -1 & x_i^* < 0 \\ [-1,1] & x_i^*=0 \end{cases}$
$[\text{prox}_{\alpha_k \lambda \|x\|_1}(z)]_i = \begin{cases} z_i - \alpha_k \lambda & z_i > \alpha_k \lambda \\ z_i + \alpha_k \lambda & z_i < -\alpha_k \lambda\\ 0 & z_i \in \alpha_k\lambda [-1,1] \end{cases}$
This is so-called the soft-thresholding operator
$F$ is $L$-strongly smooth, i.e.,
$$ \| \nabla F(x) - \nabla F(y) \|_2 \le L \|x - y\|_2, \;\; \forall x, y \in\R^n $$Then we can show for PGD using a constant stepsize $\alpha \le \frac{1}{L}$,
$$ f(x_{k}) - f(x^*) \le \frac{\|x_0 - x^*\|^2}{2\alpha k} $$$\nabla F$ is $L$-Lipschitz continuous,
$\Rightarrow f(y) \le F(x) + \nabla F(x)^T (y-x) + \frac{L}{2}\|y-x\|^2 + \Psi(y), \;\; \forall x, y$
Using the update rule: $x_{k+1} = x_k - \alpha_k G_{\alpha_k}^\Psi(x_k)$,
$f(x_{k+1}) \le F(x_k) -\alpha_k \nabla F(x_k)^T G_{\alpha_k}^\Psi(x_k) + \alpha_k^2 \frac{L}{2} \|G_{\alpha_k}^\Psi (x_k)\|^2 + \Psi(x_{k+1}) \qquad (*)$
On the other hand, (recall $u = \text{prox}_h(z) \;\; \Leftrightarrow \;\; z - u \in \partial h(u)$)
$x_{k+1} = x_k - \alpha_k G_{\alpha_k}^\Psi(x_k) = \text{prox}_{\alpha_k \Psi}(x_k - \alpha_k \nabla F(x_k))$
$\Rightarrow G_{\alpha_k}^\Psi(x_k) - \nabla F(x_k) \in \partial \Psi(x_k - \alpha G_{\alpha_k}^\Psi(x_k))$
From the definition of a subgradient,
$ \Psi(x^*) \ge \Psi(x_{k+1}) + (G_{\alpha_k}^\Psi(x_k) - \nabla F(x_k))^T(x^* - x_{k+1}) \qquad (**) $
$f(x_{k+1}) \le F(x_k) -\alpha_k \nabla F(x_k)^T G_{\alpha_k}^\Psi(x_k) + \alpha_k^2 \frac{L}{2} \|G_{\alpha_k}^\Psi (x_k)\|^2 + \Psi(x_{k+1}) \qquad (*)$
$\Psi(x^*) \ge \Psi(x_{k+1}) + (G_{\alpha_k}^\Psi(x_k) - \nabla F(x_k))^T(x^* - x_{k+1}) \qquad (**) \Rightarrow (*)$
$F(x^*) \ge F(x_k) + \nabla F(x_k)^T (x^*-x_k) \qquad (***) \Rightarrow (*)$
\begin{align*} f(x_{k+1}) &\le f(x^*) - \nabla F(x_k)^T (x^*-x_k)\\ & -(G_{\alpha_k}^\Psi (x_k)-\nabla F(x_k))^T(x^* - x_k + \alpha_k G_{\alpha_k}^\Psi (x_k)) \\ &- \alpha_k \nabla F(x_k)^T G_{\alpha_k}^\Psi(x_k) + \alpha_k^2 \frac{L}{2} \|G_{\alpha_k}^\Psi (x_k)\|^2 \end{align*}$f(x_{k+1}) \le f(x^*) + G_{\alpha_k}^\Psi (x_k)^T (x_k - x^*) + \alpha_k \left( \frac{\alpha_kL}{2} - 1\right) \|G_{\alpha_k}^\Psi (x_k)\|^2 $
$f(x_{k+1}) \le f(x^*) + G_{\alpha_k}^\Psi (x_k)^T (x_k - x^*) + \alpha_k \left( \frac{\alpha_kL}{2} - 1\right) \|G_{\alpha_k}^\Psi (x_k)\|^2$
Choosing $\alpha_k = \alpha \le \frac{1}{L}$,
$$f(x_{k+1}) \le f(x^*) + G_{\alpha_k}^\Psi (x_k)^T (x_k - x^*) -\frac{\alpha_k}{2} \|G_{\alpha_k}^\Psi (x_k)\|^2$$Using the fact that $G_{\alpha_k}^\Psi (x_k) = \frac{1}{\alpha_k} (x_k - x_{k+1})$,
$$f(x_{k+1}) \le f(x^*) + \frac{1}{2\alpha_k} \left( \|x_k - x^*\|^2 - \|x_{k+1} - x^*\|^2 \right)$$Therefore,
$$ \sum_{i=1}^{k} (f(x_{i}) - f(x^*)) \le \frac{1}{2\alpha} \left( \|x_0 - x^*\|^2 - \|x_{k} - x^*\|^2 \right) \le \frac{1}{2\alpha} \|x_0 - x^*\|^2 $$If $f(x_k)$ is nonincreasing, then
$$ k (f(x_k) - f(x^*) \le \sum_{i=1}^{k} (f(x_{i}) - f(x^*)) \le \frac{1}{2\alpha} \|x_0 - x^*\|^2 $$Therefore,
$$ f(x_{k}) - f(x^*) \le \frac{\|x_0 - x^*\|^2}{2\alpha k} $$Repeat the analysis with
$f(x_{k+1}) \le F(x_k) -\alpha_k \nabla F(x_k)^T G_{\alpha_k}^\Psi(x_k) + \alpha_k^2 \frac{L}{2} \|G_{\alpha_k}^\Psi (x_k)\|^2 + \Psi(x_{k+1}) \qquad (*)$
$\Psi(x_k) \ge \Psi(x_{k+1}) + (G_{\alpha_k}^\Psi(x_k) - \nabla F(x_k))^T(x_k - x_{k+1}) \qquad (**) \Rightarrow (*)$
\begin{align*} f(x_{k+1}) &\le f(x_k) -(G_{\alpha_k}^\Psi (x_k)-\nabla F(x_k))^T(\alpha_k G_{\alpha_k}^\Psi (x_k)) \\ &- \alpha_k \nabla F(x_k)^T G_{\alpha_k}^\Psi(x_k) + \alpha_k^2 \frac{L}{2} \|G_{\alpha_k}^\Psi (x_k)\|^2 \end{align*}To obtain:
\begin{align*} f(x_{k+1}) & \le f(x_k) + \alpha_k \left( \frac{\alpha_kL}{2} - 1\right) \|G_{\alpha_k}^\Psi (x_k)\|^2\\ &\le f(x_k) \qquad \text{if $\alpha_k \le \frac{1}{L}$} \end{align*}$\beta \in (0,1)$
In each iteration $k$, search for a stepsize $\alpha$ (beginning from $\alpha=1$) satisfying:
$$ f(x_k - \alpha G_{\alpha}^\Psi) \le f(x_k) - \frac{\alpha}{2} \|G_\alpha^\Psi(x_k)\|^2 $$while updating $\alpha = \beta \alpha$.
Then we have
$$ f(x_{k}) - f(x^*) \le \frac{\|x_0 - x^*\|^2}{2\alpha_{\min} k}, \qquad \alpha_{\min} = \min\{1, \beta / L\} $$A technique to achieve the optimal rate $O\left(\frac{1}{k^2}\right)$ in the first-order method
Ideas by Yurii Nesterov
Extension of Nesterov's first (1983) idea for decomposable objective function
$y_1 = x_0 \in \R^n$, $t_1 = 1$
For $k=1,2,\dots$:
\begin{align*} x_k &= \prox_{\alpha_k\Psi} (y_k - \alpha_k\nabla F(y_k))\\ t_{k+1} &= \frac{1+\sqrt{1+4t_k^2}}{2}\\ y_{k+1} &= x_k + \left( \frac{t_k-1}{t_{k+1}} \right) (x_k - x_{k-1}) \end{align*}We choose $\alpha_k = 1/L$ as in the proximal GD.
%%tikz --scale 1 --size 1100,1100 -f png
\draw[->] (-1,0) -- (2.5,0) node[right] {\scriptsize $R^n$};
\draw[domain=-1.5:2.2, smooth, variable=\x, blue] plot({\x}, {.5+.5*\x*\x});
\draw[domain=-.9:2, thick, variable=\y, red] plot({\y}, {1 + (\y-1) + (\y-1)*(\y-1)});
\draw[dotted] (1,1) -- (1,0) node[below] {\tiny $x_k$};
\draw[dotted] (.5,.75) -- (.5,0) node[below] {\tiny $x_{k+1}$};
\fill (1,1) circle [radius=1pt];
\fill (.5,.75) circle [radius=1pt];
\draw[dashed,->] (1.8,1+.8+.8^2) -- (3, 1.5+.5*1.8^2) node[right] {$f(x) \approx f(x_k) + \nabla f(x_k)^T(x-x_k) + \frac{1}{2\alpha_k}\|x-x_k\|^2$};
%%tikz --scale 1 --size 1100,1100 -f png
\newcommand{\tikzcircle}[2][red,fill=red]{\tikz[baseline=-0.5ex]\draw[#1,radius=#2] (0,0) circle ;}%
\draw[->] (-1,0) -- (2.5,0) node[right] {\scriptsize $R^n$};
\draw[domain=-1.5:2.2, smooth, variable=\x, blue] plot({\x}, {.5+.5*\x*\x});
\draw[domain=-.9:2, thick, variable=\y, red] plot({\y}, {.5+.5*.7^2 + .7*(\y-.7) + (\y-.7)*(\y-.7)});
\draw[thick, ->] (1.5, 0) -- (.7,0);
\fill (1.5,.5+.5*1.5^2) circle [radius=1pt]; \draw[dotted] (1.5, .5+.5*1.5^2) -- (1.5, 0) node[below] {\tiny $x_{k-1}$};
\fill (1,1) circle [radius=1pt]; \draw[dotted] (1,1) -- (1,0) node[below] {\tiny $x_k$};
\draw [red,fill=red] (.7,.5+.5*.7^2) circle [radius=1pt]; \draw[dotted] (.7,.5+.5*.7^2) -- (.7,0) node[above, red] {\tiny $y_{k}$};
\fill (.35, .5+.35^2) circle [radius=1pt]; \draw[dotted] (.35, .5+.35^2) -- (.35,0) node[below] {\tiny $x_{k+1}$};
\draw[dashed,->] (1.8,1+.8+.8^2) -- (3, 1.5+.5*1.8^2) node[right] {$f(x) \approx f(y_k) + \nabla f(y_k)^T(x-y_k) + \frac{1}{2\alpha_k}\|x-y_k\|^2$};
\draw[dashed,->] (1.1, 0.1) -- (3,2) node[right] {\scriptsize $x_k + \tau_k (\theta_k) (x_k-x_{k-1})$};
AGD with the constant stepsize $\alpha_k = 1/L$ gives
$$ f(x_k) - f(x^*) \le \frac{2\|x_0 - x^*\|^2}{\alpha (k+1)^2} $$This achieves the optimal rate for the first-order optimization
To get an $\epsilon$ suboptimal solution, we need $O(1/\sqrt{\epsilon})$ iterations!
Any property of PGD that is missing in AGD?
include("../_julia/julia/lasso.ISTA.jl")
include("../_julia/julia/lasso.FISTA.jl")
using Distributions
using PyPlot
m = 100
n = 100
lam = .001
X = rand(Normal(0, 2), (m,n))
wstar = rand(Normal(1.2, 2.0), (n,1))
y = X * wstar;
maxiter = 1000;
ista = lassoISTA(X, y, step="const", lam=lam, maxit=maxiter, repf=true, repg=true);
simple = lassoFISTA(X, y, step="simple", lam=lam, maxit=maxiter, repf=true, repg=true);
fista = lassoFISTA(X, y, step="const", lam=lam, maxit=maxiter, repf=true, repg=true);
subplot(121)
PyPlot.semilogy(1:maxiter, ista[2], color="blue");
PyPlot.semilogy(1:maxiter, simple[2], color="green");
PyPlot.semilogy(1:maxiter, fista[2], color="red", linestyle="dashed");
xlabel("Iterations")
ylabel("Objective function value")
ax = gca()
ax[:set_ylim]([0, 100])
ax[:legend](("PGD", "AGD (simple)", "AGD (fista)"), loc=1)
subplot(122)
PyPlot.semilogy(1:maxiter, ista[3], color="blue");
PyPlot.semilogy(1:maxiter, simple[3], color="green");
PyPlot.semilogy(1:maxiter, fista[3], color="red", linestyle="dashed");
xlabel("Iterations")
ylabel("Gradient inf norm")
ax = gca()
ax[:set_ylim]([0, 10])
ax[:legend](("PGD", "AGD (simple)", "AGD (fista)"), loc=1)
tight_layout()
L = 1481.5710554161046 L = 1481.5710554161046
Ex 11.1 (Subgradient characterization) Suppose that $h:\R^n\to \R$ is a convex function. Show that the following is true:
$$ u = \text{prox}_h(z) \;\; \Leftrightarrow \;\; z - u \in \partial h(u), $$and therefore $$ h(z) - h(u) \ge \|z-u\|_2^2 $$
Ex 11.2 (Optimality Condition) Suppose that we solve the following problem:
$$ \min_{x\in\R^n} \;\; f(x) := F(x) + \Psi(x) $$where $F:\R^n\to\R$ and $\Psi:\R^n\to\R$ are proper convex functions. $F$ is also continuously differentiable. In this case, the first order sufficient and necessary condition for convex optimization applies in a general form, i.e.,
$$ x^* \text{ is a global minimizer} \;\; \Leftrightarrow \;\; 0 \in \partial f(x^*) $$