Update:
\begin{align*} x_{k+1} &= \argmin_{x} \left\{ F(x_k) + \nabla F(x_k)^T (x-x_k) + \frac{1}{2\alpha_k} \|x-x_k\|^2 + \Psi(x) \right\}\\ &= \text{prox}_{\alpha_k \Psi} ( x_k - \alpha_k \nabla F(x_k)) \end{align*}Then with stepsize $\alpha_k = \alpha \le \frac{1}{L}$, we have monotone ($f(x_{k+1}) \le f(x_k)$) convergence with rate:
$$ f(x_{k}) - f(x^*) \le \frac{\|x_0 - x^*\|^2}{2\alpha k} $$A technique to achieve the optimal rate $O\left(\frac{1}{k^2}\right)$ in the first-order method
Ideas by Yurii Nesterov
Extension of Nesterov's first (1983) idea for decomposable objective function
$y_1 = x_0 \in \R^n$, $t_1 = 1$
For $k=1,2,\dots$:
\begin{align*} x_k &= \prox_{\alpha_k\Psi} (y_k - \alpha_k\nabla F(y_k))\\ t_{k+1} &= \frac{1+\sqrt{1+4t_k^2}}{2}\\ y_{k+1} &= x_k + \left( \frac{t_k-1}{t_{k+1}} \right) (x_k - x_{k-1}) \end{align*}We choose $\alpha_k = 1/L$ as in the proximal GD.
%%tikz --scale 1 --size 1100,1100 -f png
\draw[->] (-1,0) -- (2.5,0) node[right] {\scriptsize $R^n$};
\draw[domain=-1.5:2.2, smooth, variable=\x, blue] plot({\x}, {.5+.5*\x*\x});
\draw[domain=-.9:2, thick, variable=\y, red] plot({\y}, {1 + (\y-1) + (\y-1)*(\y-1)});
\draw[dotted] (1,1) -- (1,0) node[below] {\tiny $x_k$};
\draw[dotted] (.5,.75) -- (.5,0) node[below] {\tiny $x_{k+1}$};
\fill (1,1) circle [radius=1pt];
\fill (.5,.75) circle [radius=1pt];
\draw[dashed,->] (1.8,1+.8+.8^2) -- (3, 1.5+.5*1.8^2) node[right] {$f(x) \approx f(x_k) + \nabla f(x_k)^T(x-x_k) + \frac{1}{2\alpha_k}\|x-x_k\|^2$};
%%tikz --scale 1 --size 1100,1100 -f png
\newcommand{\tikzcircle}[2][red,fill=red]{\tikz[baseline=-0.5ex]\draw[#1,radius=#2] (0,0) circle ;}%
\draw[->] (-1,0) -- (2.5,0) node[right] {\scriptsize $R^n$};
\draw[domain=-1.5:2.2, smooth, variable=\x, blue] plot({\x}, {.5+.5*\x*\x});
\draw[domain=-.9:2, thick, variable=\y, red] plot({\y}, {.5+.5*.7^2 + .7*(\y-.7) + (\y-.7)*(\y-.7)});
\draw[thick, ->] (1.5, 0) -- (.7,0);
\fill (1.5,.5+.5*1.5^2) circle [radius=1pt]; \draw[dotted] (1.5, .5+.5*1.5^2) -- (1.5, 0) node[below] {\tiny $x_{k-1}$};
\fill (1,1) circle [radius=1pt]; \draw[dotted] (1,1) -- (1,0) node[below] {\tiny $x_k$};
\draw [red,fill=red] (.7,.5+.5*.7^2) circle [radius=1pt]; \draw[dotted] (.7,.5+.5*.7^2) -- (.7,0) node[above, red] {\tiny $y_{k}$};
\fill (.35, .5+.35^2) circle [radius=1pt]; \draw[dotted] (.35, .5+.35^2) -- (.35,0) node[below] {\tiny $x_{k+1}$};
\draw[dashed,->] (1.8,1+.8+.8^2) -- (3, 1.5+.5*1.8^2) node[right] {$f(x) \approx f(y_k) + \nabla f(y_k)^T(x-y_k) + \frac{1}{2\alpha_k}\|x-y_k\|^2$};
\draw[dashed,->] (1.1, 0.1) -- (3,2) node[right] {\scriptsize $x_k + \tau_k (\theta_k) (x_k-x_{k-1})$};
AGD with the constant stepsize $\alpha_k = 1/L$ gives
$$ f(x_k) - f(x^*) \le \frac{2\|x_0 - x^*\|^2}{\alpha (k+1)^2} $$This achieves the optimal rate for the first-order optimization
To get an $\epsilon$ suboptimal solution, we need $O(1/\sqrt{\epsilon})$ iterations!
$\Rightarrow$
\begin{align*} y_k &= (1-\theta_k) x_k + \theta_k u_k \\ x_{k+1} &= \text{prox}_{\alpha_k \Psi} ( y_k - \alpha_k \nabla F(y_k)) \\ u_{k+1} &= x_k + \frac{1}{\theta_k} (x_{k+1} - x_k) \end{align*}with $u_0=x_0$ and $\theta_k = 2/(k+2)$
For convenience, we write:
\begin{align*} y_k &= (1-\theta_k) x_k + \theta_k u_k \\ x_{k+1} &= \text{prox}_{\alpha_k \Psi} ( y_k - \alpha_k \nabla F(y_k)) \\ u_{k+1} &= x_k + \frac{1}{\theta_k} (x_{k+1} - x_k) \end{align*}$\Rightarrow$
\begin{align*} y &= (1-\theta) x + \theta u \\ x^+ &= \text{prox}_{\alpha \Psi} ( y - \alpha \nabla F(y)) \\ u^+ &= x + \frac{1}{\theta} (x^+ - x) \end{align*}Here, we choose $\theta_k = 2/(k+2)$.
In fact, any $\theta_k>0$ can be used satisfying the relation
$$ \frac{1-\theta_{k+1}}{\theta_{k+1}^2} \le \frac{1}{\theta_k^2} $$If $\alpha_{k+1}\le \alpha_k$, then this implies that
$$ (\spadesuit) \qquad \frac{\alpha_{k+1}(1-\theta_{k+1})}{\theta_{k+1}^2} \le \frac{\alpha_k}{\theta_k^2} $$Since $\Psi$ is a convex function, we get
$$ \Psi(z) \ge \Psi(v) + g(v)^T (z-v), \;\; \forall v, z\in \R^n, \;\; g(v) \in \partial \Psi(v) $$For $v = x^+$ and $z = y - \alpha \nabla F(y)$ with $v = \prox_{\alpha \Psi}(z)$, we know:
$$ -(v-z) \in \partial \alpha \Psi(v) $$Together, we get
$$ \Psi(z) \ge \Psi(x^+) - \frac{1}{\alpha}(x^+ -z)^T (y - \alpha \nabla F(y)-x^+), \;\; \forall z $$Rearranging terms, we have
$$ (*) \qquad \Psi(x^+) \le \Psi(z) + \frac{1}{\alpha}(x^+ -z)^T (y - x^+) + \nabla F(y)^T(z - x^+), \;\; \forall z $$$\nabla F$ is $L$-Lipschitz continuous: with $\alpha \le 1/L$,
$$ (**) \qquad F(x^+) \le F(y) + \nabla F(y)^T(x^+ - y) + \frac{1}{2\alpha} \|x^+ - y\|^2 $$Summing up together with:
$$ (*) \qquad \Psi(x^+) \le \Psi(z) + \frac{1}{\alpha}(x^+ -z)^T (y - x^+) + \nabla F(y)^T(z - x^+) $$gives
\begin{align*} (***) \qquad f(x^+) &\le F(y) + \Psi(z) + \nabla F(y)^T (z - y) + \frac{1}{\alpha} (x^+ - y)^T(z-x^+) \\ &+ \frac{1}{2\alpha} \|x^+ - y\|^2, \;\; \forall z \end{align*}From the convexity of $F$, we have
$$ F(y) \le F(z) - \nabla F(y)^T(z-y), \;\; \forall z $$Then $(***)$ simplifies to:
$$ (\dagger) \qquad f(x^+) \le f(z) + \frac{1}{\alpha} (x^+ - y)^T(z-x^+) + \frac{1}{2\alpha} \|x^+ - y\|^2, \;\; \forall z $$Create two versions with $z=x$ and $z=x^*$, multiply both sides of each by $(1-\theta)$ and $\theta$:
\begin{align*} (1-\theta) \;&: \; f(x^+) \le f(x) + \frac{1}{\alpha} (x^+ - y)^T(x-x^+) + \frac{1}{2\alpha} \|x^+ - y\|^2 \\ \theta \; &: \;f(x^+) \le f(x^*) + \frac{1}{\alpha} (x^+ - y)^T(x^*-x^+) + \frac{1}{2\alpha} \|x^+ - y\|^2 \end{align*}Adding them together:
\begin{align*} (\dagger\dagger) \qquad & f(x^+) - f(x^*) - (1-\theta)(f(x) - f(x^*)) \\ &\quad \le \frac{1}{\alpha} (x^+ - y)^T (\theta x^* + (1-\theta)x - x^+) + \frac{1}{2\alpha} \|x^+ - y\|^2 \end{align*}Replace in:
\begin{align*} (\dagger\dagger) \qquad & f(x^+) - f(x^*) - (1-\theta)(f(x) - f(x^*)) \\ &\quad \le \frac{1}{\alpha} \underbrace{(x^+ - y)}_{\theta(u^+ - u)}^T (\theta x^* + \underbrace{(1-\theta)x - x^+}_{-\theta u^+}) + \frac{1}{2\alpha} \|\underbrace{x^+ - y}_{\theta(u^+-u)}\|^2 \end{align*}Finally, we get
$$ f(x^+) - f(x^*) - (1-\theta)(f(x) - f(x^*)) \le \frac{\theta^2}{2\alpha} (u^+ - u)^T(2x^* - u^+ -u) $$Notice that
\begin{align*} &f(x^+) - f(x^*) - (1-\theta)(f(x) - f(x^*)) \\ &\le \frac{\theta^2}{2\alpha} (u^+ - u)^T(2x^* - u^+ -u) = \frac{\theta^2}{2\alpha} \big\{ \|u-x^*\|^2 - \|u^+ -x^*\|^2 \big\} \end{align*}Rewriting with indices, this implies that:
\begin{align*} &\frac{\alpha_k}{\theta_k^2} (f(x_{k+1}) - f(x^*)) + \frac{1}{2} \|u_{k+1}-x^*\|^2 \\ &\;\; \le \frac{\alpha_k(1-\theta_k)}{\theta_k^2} (f(x_k) - f(x^*)) + \frac{1}{2} \|u_k-x^*\|^2 \end{align*}Entailing the inequality for $i=0,1,\dots,k-1$ (using $(\spadesuit)$ whenever needed):
$$ \begin{aligned} &\frac{\alpha_i}{\theta_i^2} (f(x_{i+1}) - f(x^*)) + \frac{1}{2} \|u_{i+1}-x^*\|^2 \\ &\;\; \le \frac{\alpha_i(1-\theta_k)}{\theta_i^2} (f(x_i) - f(x^*)) + \frac{1}{2} \|u_i-x^*\|^2 \end{aligned}, \quad (\spadesuit) \;\; \frac{\alpha_{k+1}(1-\theta_{k+1})}{\theta_{k+1}^2} \le \frac{\alpha_k}{\theta_k^2} $$$ \Rightarrow $
\begin{align*} &\frac{\alpha_{k-1}}{\theta_{k-1}^2} (f(x_k) - f(x^*)) + \frac{1}{2} \|u_k-x^*\|^2 \\ &\;\; \le \frac{\alpha_0(1-\theta_0)}{\theta_0^2} (f(x_0) - f(x^*)) + \frac{1}{2} \|u_0-x^*\|^2\\ &\;\; = \frac12 \|x_0 - x^*\|^2 \;\; (\text{when we choose } \theta_0 = 1, u_0 = x_0) \end{align*}
Finally, with $\alpha_k = \alpha$ and $\theta = 2/(k+1)$,
$$ f(x_k) - f(x^*) \le \frac{\theta_{k-1}^2}{2\alpha_{k-1}} \|x_0-x^*\|^2 = \frac{2\|x_0 - x^*\|^2}{\alpha (k+1)^2} $$For $k=0,1,2,\dots$:
\begin{align*} y_k &= (1-\theta_k) x_k + \theta_k u_k \\ x_{k+1} &= \text{prox}_{\alpha_k \Psi} ( y_k - \alpha_k \nabla F(y_k)) \\ u_{k+1} &= x_k + \frac{1}{\theta_k} (x_{k+1} - x_k) \end{align*}Required Conditions:
Regarding stepsize, AGD requires two conditions:
Backtracking Linesearch: given $\beta \in (0,1)$,
$\alpha_k = \alpha_{k-1}$
$x_{k+1} = \prox_{\alpha_k \Psi}(y_k - \alpha_k \nabla F(y_k))$
While( $F(x_{k+1}) > F(y_k) + \nabla F(y_k)^T (x_{k+1} - y_k) + \frac{1}{2\alpha_k} \|x_{k+1} - y_k\|^2$ )
$\alpha_k = \beta \alpha_k$
$x_{k+1} = \prox_{\alpha_k \Psi}(y_k - \alpha_k \nabla F(y_k))$
Any property of PGD that is missing in AGD?
include("lasso.ISTA.jl")
include("lasso.FISTA.jl")
using Distributions
using PyPlot
m = 100
n = 100
lam = .001
X = rand(Normal(0, 2), (m,n))
wstar = rand(Normal(1.2, 2.0), (n,1))
y = X * wstar;
maxiter = 1000;
ista = lassoISTA(X, y, step="const", lam=lam, maxit=maxiter, repf=true, repg=true);
simple = lassoFISTA(X, y, step="simple", lam=lam, maxit=maxiter, repf=true, repg=true);
fista = lassoFISTA(X, y, step="const", lam=lam, maxit=maxiter, repf=true, repg=true);
subplot(121)
PyPlot.semilogy(1:maxiter, ista[2], color="blue");
PyPlot.semilogy(1:maxiter, simple[2], color="green");
PyPlot.semilogy(1:maxiter, fista[2], color="red", linestyle="dashed");
xlabel("Iterations")
ylabel("Objective function value")
ax = gca()
ax[:set_ylim]([0, 100])
ax[:legend](("PGD", "AGD (simple)", "AGD (fista)"), loc=1)
subplot(122)
PyPlot.semilogy(1:maxiter, ista[3], color="blue");
PyPlot.semilogy(1:maxiter, simple[3], color="green");
PyPlot.semilogy(1:maxiter, fista[3], color="red", linestyle="dashed");
xlabel("Iterations")
ylabel("Gradient inf norm")
ax = gca()
ax[:set_ylim]([0, 10])
ax[:legend](("PGD", "AGD (simple)", "AGD (fista)"), loc=1)
L = 1421.0872221442771 L = 1421.0872221442771 L = 1421.0872221442771
PyObject <matplotlib.legend.Legend object at 0x31d71fcd0>