Large-Scale Optimization¶

$\newcommand{\R}{\mathbb{R}} \newcommand{\E}{\mathbb{E}} \DeclareMathOperator*{\argmin}{arg\,min} \DeclareMathOperator*{\argmax}{arg\,max} \newcommand{\prox}{\text{prox}} $

L7. AGD: Accelerated Gradient Descent¶

TU Dortmund University, Dr. Sangkyun Lee

Recall: Proximal Gradient Descent (aka GGD, ISTA, FOBOS,...)¶

$$ \min_{x\in\R^n} \;\; f(x) = F(x) + \Psi(x) $$

$F$ convex and $L$-strongly smooth
$\Psi$ convex, nonsmooth but "simple"

Update:

\begin{align*} x_{k+1} &= \argmin_{x} \left\{ F(x_k) + \nabla F(x_k)^T (x-x_k) + \frac{1}{2\alpha_k} \|x-x_k\|^2 + \Psi(x) \right\}\\ &= \text{prox}_{\alpha_k \Psi} ( x_k - \alpha_k \nabla F(x_k)) \end{align*}

Then with stepsize $\alpha_k = \alpha \le \frac{1}{L}$, we have monotone ($f(x_{k+1}) \le f(x_k)$) convergence with rate:

$$ f(x_{k}) - f(x^*) \le \frac{\|x_0 - x^*\|^2}{2\alpha k} $$

Acceleration¶

A technique to achieve the optimal rate $O\left(\frac{1}{k^2}\right)$ in the first-order method

Ideas by Yurii Nesterov

1983: A method for unconstrained convex minimization with the rate of convergence O(1/k^2), Doklady AN SSSR 269
1988: On an approach to the construction of optimal methods of minimization of smooth convex functions, Èkonom. i. Mat. Metody 24
2005: Smooth minimization of nonsmooth functions, Math. Program 103
2007: Gradient methods for minimizing composite objective function, CORE report (Math. Program, 2013)

FISTA (Fast ISTA) [Beck & Teboulle, SIAM J. Imaging Sci., 2009]¶

Extension of Nesterov's first (1983) idea for decomposable objective function

$y_1 = x_0 \in \R^n$, $t_1 = 1$

For $k=1,2,\dots$:

\begin{align*} x_k &= \prox_{\alpha_k\Psi} (y_k - \alpha_k\nabla F(y_k))\\ t_{k+1} &= \frac{1+\sqrt{1+4t_k^2}}{2}\\ y_{k+1} &= x_k + \left( \frac{t_k-1}{t_{k+1}} \right) (x_k - x_{k-1}) \end{align*}

We choose $\alpha_k = 1/L$ as in the proximal GD.

A Simpler Version¶

\begin{align*} y_k &= x_k + \frac{k-1}{k+2} (x_k - x_{k-1})\\ x_{k+1} &= \prox_{\alpha_k \Psi} (y_k - \alpha_k \nabla F(y_k)) \end{align*}

PGD vs. AGD¶

In [51]:

%%tikz --scale 1 --size 1100,1100 -f png
\draw[->] (-1,0) -- (2.5,0) node[right] {\scriptsize $R^n$};
\draw[domain=-1.5:2.2, smooth, variable=\x, blue] plot({\x}, {.5+.5*\x*\x});
\draw[domain=-.9:2, thick, variable=\y, red] plot({\y}, {1 + (\y-1) + (\y-1)*(\y-1)});
\draw[dotted] (1,1) -- (1,0) node[below] {\tiny $x_k$};
\draw[dotted] (.5,.75) -- (.5,0) node[below] {\tiny $x_{k+1}$};
\fill (1,1) circle [radius=1pt];
\fill (.5,.75) circle [radius=1pt];
\draw[dashed,->] (1.8,1+.8+.8^2) -- (3, 1.5+.5*1.8^2) node[right] {$f(x) \approx f(x_k) + \nabla f(x_k)^T(x-x_k) + \frac{1}{2\alpha_k}\|x-x_k\|^2$};

In [59]:

%%tikz --scale 1 --size 1100,1100 -f png
\newcommand{\tikzcircle}[2][red,fill=red]{\tikz[baseline=-0.5ex]\draw[#1,radius=#2] (0,0) circle ;}%

\draw[->] (-1,0) -- (2.5,0) node[right] {\scriptsize $R^n$};
\draw[domain=-1.5:2.2, smooth, variable=\x, blue] plot({\x}, {.5+.5*\x*\x});
\draw[domain=-.9:2, thick, variable=\y, red] plot({\y}, {.5+.5*.7^2 + .7*(\y-.7) + (\y-.7)*(\y-.7)});
\draw[thick, ->] (1.5, 0) -- (.7,0); 
\fill (1.5,.5+.5*1.5^2) circle [radius=1pt]; \draw[dotted] (1.5, .5+.5*1.5^2) -- (1.5, 0) node[below] {\tiny $x_{k-1}$};
\fill (1,1) circle [radius=1pt]; \draw[dotted] (1,1) -- (1,0) node[below] {\tiny $x_k$};
\draw [red,fill=red] (.7,.5+.5*.7^2) circle [radius=1pt]; \draw[dotted] (.7,.5+.5*.7^2) -- (.7,0) node[above, red] {\tiny $y_{k}$};
\fill (.35, .5+.35^2) circle [radius=1pt]; \draw[dotted] (.35, .5+.35^2) -- (.35,0) node[below] {\tiny $x_{k+1}$};
\draw[dashed,->] (1.8,1+.8+.8^2) -- (3, 1.5+.5*1.8^2) node[right] {$f(x) \approx f(y_k) + \nabla f(y_k)^T(x-y_k) + \frac{1}{2\alpha_k}\|x-y_k\|^2$};
\draw[dashed,->] (1.1, 0.1) -- (3,2) node[right] {\scriptsize $x_k + \tau_k (\theta_k) (x_k-x_{k-1})$};

AGD: Convergence Rate¶

AGD with the constant stepsize $\alpha_k = 1/L$ gives

$$ f(x_k) - f(x^*) \le \frac{2\|x_0 - x^*\|^2}{\alpha (k+1)^2} $$

This achieves the optimal rate for the first-order optimization

To get an $\epsilon$ suboptimal solution, we need $O(1/\sqrt{\epsilon})$ iterations!

Reformulation for Analysis¶

\begin{align*} y_k &= x_k + \frac{k-1}{k+2} (x_k - x_{k-1})\\ x_{k+1} &= \text{prox}_{\alpha_k \Psi} ( y_k - \alpha_k \nabla F(y_k)) \end{align*}

$\Rightarrow$

\begin{align*} y_k &= (1-\theta_k) x_k + \theta_k u_k \\ x_{k+1} &= \text{prox}_{\alpha_k \Psi} ( y_k - \alpha_k \nabla F(y_k)) \\ u_{k+1} &= x_k + \frac{1}{\theta_k} (x_{k+1} - x_k) \end{align*}

with $u_0=x_0$ and $\theta_k = 2/(k+2)$

For convenience, we write:

\begin{align*} y_k &= (1-\theta_k) x_k + \theta_k u_k \\ x_{k+1} &= \text{prox}_{\alpha_k \Psi} ( y_k - \alpha_k \nabla F(y_k)) \\ u_{k+1} &= x_k + \frac{1}{\theta_k} (x_{k+1} - x_k) \end{align*}

$\Rightarrow$

\begin{align*} y &= (1-\theta) x + \theta u \\ x^+ &= \text{prox}_{\alpha \Psi} ( y - \alpha \nabla F(y)) \\ u^+ &= x + \frac{1}{\theta} (x^+ - x) \end{align*}

About $\theta_k$¶

Here, we choose $\theta_k = 2/(k+2)$.

In fact, any $\theta_k>0$ can be used satisfying the relation

$$ \frac{1-\theta_{k+1}}{\theta_{k+1}^2} \le \frac{1}{\theta_k^2} $$

If $\alpha_{k+1}\le \alpha_k$, then this implies that

$$ (\spadesuit) \qquad \frac{\alpha_{k+1}(1-\theta_{k+1})}{\theta_{k+1}^2} \le \frac{\alpha_k}{\theta_k^2} $$

About $\Psi$¶

Since $\Psi$ is a convex function, we get

$$ \Psi(z) \ge \Psi(v) + g(v)^T (z-v), \;\; \forall v, z\in \R^n, \;\; g(v) \in \partial \Psi(v) $$

For $v = x^+$ and $z = y - \alpha \nabla F(y)$ with $v = \prox_{\alpha \Psi}(z)$, we know:

$$ -(v-z) \in \partial \alpha \Psi(v) $$

Together, we get

$$ \Psi(z) \ge \Psi(x^+) - \frac{1}{\alpha}(x^+ -z)^T (y - \alpha \nabla F(y)-x^+), \;\; \forall z $$

Rearranging terms, we have

$$ (*) \qquad \Psi(x^+) \le \Psi(z) + \frac{1}{\alpha}(x^+ -z)^T (y - x^+) + \nabla F(y)^T(z - x^+), \;\; \forall z $$

About $\nabla F$¶

$\nabla F$ is $L$-Lipschitz continuous: with $\alpha \le 1/L$,

$$ (**) \qquad F(x^+) \le F(y) + \nabla F(y)^T(x^+ - y) + \frac{1}{2\alpha} \|x^+ - y\|^2 $$

Summing up together with:

$$ (*) \qquad \Psi(x^+) \le \Psi(z) + \frac{1}{\alpha}(x^+ -z)^T (y - x^+) + \nabla F(y)^T(z - x^+) $$

gives

\begin{align*} (***) \qquad f(x^+) &\le F(y) + \Psi(z) + \nabla F(y)^T (z - y) + \frac{1}{\alpha} (x^+ - y)^T(z-x^+) \\ &+ \frac{1}{2\alpha} \|x^+ - y\|^2, \;\; \forall z \end{align*}

From the convexity of $F$, we have

$$ F(y) \le F(z) - \nabla F(y)^T(z-y), \;\; \forall z $$

Then $(***)$ simplifies to:

$$ (\dagger) \qquad f(x^+) \le f(z) + \frac{1}{\alpha} (x^+ - y)^T(z-x^+) + \frac{1}{2\alpha} \|x^+ - y\|^2, \;\; \forall z $$

Create two versions with $z=x$ and $z=x^*$, multiply both sides of each by $(1-\theta)$ and $\theta$:

\begin{align*} (1-\theta) \;&: \; f(x^+) \le f(x) + \frac{1}{\alpha} (x^+ - y)^T(x-x^+) + \frac{1}{2\alpha} \|x^+ - y\|^2 \\ \theta \; &: \;f(x^+) \le f(x^*) + \frac{1}{\alpha} (x^+ - y)^T(x^*-x^+) + \frac{1}{2\alpha} \|x^+ - y\|^2 \end{align*}

Adding them together:

\begin{align*} (\dagger\dagger) \qquad & f(x^+) - f(x^*) - (1-\theta)(f(x) - f(x^*)) \\ &\quad \le \frac{1}{\alpha} (x^+ - y)^T (\theta x^* + (1-\theta)x - x^+) + \frac{1}{2\alpha} \|x^+ - y\|^2 \end{align*}

From the algorithm:¶

\begin{align*} u^+ &= x + \frac{1}{\theta}(x^+ - x) \;\; \Leftrightarrow \;\; x^+ = \theta u^+ + (1-\theta)x \;\;\Leftrightarrow \;\; -\theta u^+ = (1-\theta)x - x^+\\ y &= (1-\theta)x + \theta u \;\; \Leftrightarrow \;\; y = x^+ -\theta u^+ + \theta u \;\; \Leftrightarrow \;\; x^+ - y = \theta(u^+ - u) \end{align*}

Replace in:

\begin{align*} (\dagger\dagger) \qquad & f(x^+) - f(x^*) - (1-\theta)(f(x) - f(x^*)) \\ &\quad \le \frac{1}{\alpha} \underbrace{(x^+ - y)}_{\theta(u^+ - u)}^T (\theta x^* + \underbrace{(1-\theta)x - x^+}_{-\theta u^+}) + \frac{1}{2\alpha} \|\underbrace{x^+ - y}_{\theta(u^+-u)}\|^2 \end{align*}

Finally, we get

$$ f(x^+) - f(x^*) - (1-\theta)(f(x) - f(x^*)) \le \frac{\theta^2}{2\alpha} (u^+ - u)^T(2x^* - u^+ -u) $$

Notice that

\begin{align*} &f(x^+) - f(x^*) - (1-\theta)(f(x) - f(x^*)) \\ &\le \frac{\theta^2}{2\alpha} (u^+ - u)^T(2x^* - u^+ -u) = \frac{\theta^2}{2\alpha} \big\{ \|u-x^*\|^2 - \|u^+ -x^*\|^2 \big\} \end{align*}

Rewriting with indices, this implies that:

\begin{align*} &\frac{\alpha_k}{\theta_k^2} (f(x_{k+1}) - f(x^*)) + \frac{1}{2} \|u_{k+1}-x^*\|^2 \\ &\;\; \le \frac{\alpha_k(1-\theta_k)}{\theta_k^2} (f(x_k) - f(x^*)) + \frac{1}{2} \|u_k-x^*\|^2 \end{align*}

Entailing the inequality for $i=0,1,\dots,k-1$ (using $(\spadesuit)$ whenever needed):

$$ \begin{aligned} &\frac{\alpha_i}{\theta_i^2} (f(x_{i+1}) - f(x^*)) + \frac{1}{2} \|u_{i+1}-x^*\|^2 \\ &\;\; \le \frac{\alpha_i(1-\theta_k)}{\theta_i^2} (f(x_i) - f(x^*)) + \frac{1}{2} \|u_i-x^*\|^2 \end{aligned}, \quad (\spadesuit) \;\; \frac{\alpha_{k+1}(1-\theta_{k+1})}{\theta_{k+1}^2} \le \frac{\alpha_k}{\theta_k^2} $$

$ \Rightarrow $

\begin{align*} &\frac{\alpha_{k-1}}{\theta_{k-1}^2} (f(x_k) - f(x^*)) + \frac{1}{2} \|u_k-x^*\|^2 \\ &\;\; \le \frac{\alpha_0(1-\theta_0)}{\theta_0^2} (f(x_0) - f(x^*)) + \frac{1}{2} \|u_0-x^*\|^2\\ &\;\; = \frac12 \|x_0 - x^*\|^2 \;\; (\text{when we choose } \theta_0 = 1, u_0 = x_0) \end{align*}

In [ ]:

\begin{align*} \frac{\alpha_{k-1}}{\theta_{k-1}^2} (f(x_k) - f(x^*)) + \frac{1}{2} \|u_k-x^*\|^2 = \frac12 \|x_0 - x^*\|^2 \;\; (\text{when we choose } \theta_0 = 1, u_0 = x_0) \end{align*}

Finally, with $\alpha_k = \alpha$ and $\theta = 2/(k+1)$,

$$ f(x_k) - f(x^*) \le \frac{\theta_{k-1}^2}{2\alpha_{k-1}} \|x_0-x^*\|^2 = \frac{2\|x_0 - x^*\|^2}{\alpha (k+1)^2} $$

AGD¶

For $k=0,1,2,\dots$:

\begin{align*} y_k &= (1-\theta_k) x_k + \theta_k u_k \\ x_{k+1} &= \text{prox}_{\alpha_k \Psi} ( y_k - \alpha_k \nabla F(y_k)) \\ u_{k+1} &= x_k + \frac{1}{\theta_k} (x_{k+1} - x_k) \end{align*}

Required Conditions:

$\alpha_{k+1} \le \alpha_k \le 1/L$
$\theta_0$ = 1
$\frac{1-\theta_{k+1}}{\theta_{k+1}^2} \le \frac{1}{\theta_k^2}$
$u_0 = x_0$

Linesearch¶

Regarding stepsize, AGD requires two conditions:

$F(x_{k+1}) \le F(y_k) + \nabla F(y_k)^T (x_{k+1} - y_k) + \frac{1}{2\alpha_k} \|x_{k+1} - y_k\|^2$
$\alpha_{k} \le \alpha_{k-1}$

Backtracking Linesearch: given $\beta \in (0,1)$,

$\alpha_k = \alpha_{k-1}$

$x_{k+1} = \prox_{\alpha_k \Psi}(y_k - \alpha_k \nabla F(y_k))$

While( $F(x_{k+1}) > F(y_k) + \nabla F(y_k)^T (x_{k+1} - y_k) + \frac{1}{2\alpha_k} \|x_{k+1} - y_k\|^2$ )

$\alpha_k = \beta \alpha_k$

$x_{k+1} = \prox_{\alpha_k \Psi}(y_k - \alpha_k \nabla F(y_k))$

AGD with LS: Convergence Rate¶

$$ f(x_k) - f(x^*) \le \frac{2\|x_0 - x^*\|^2}{\alpha_{\min} (k+1)^2}, \qquad \alpha_{\min} := \min\{1, \beta/L\} $$

PGD vs. AGD¶

Any property of PGD that is missing in AGD?

In [35]:

include("lasso.ISTA.jl")
include("lasso.FISTA.jl")

using Distributions
using PyPlot

m = 100
n = 100

lam = .001

X = rand(Normal(0, 2), (m,n))
wstar = rand(Normal(1.2, 2.0), (n,1))
y = X * wstar;

maxiter = 1000;

ista = lassoISTA(X, y, step="const", lam=lam, maxit=maxiter, repf=true, repg=true);
simple = lassoFISTA(X, y, step="simple", lam=lam, maxit=maxiter, repf=true, repg=true);
fista = lassoFISTA(X, y, step="const", lam=lam, maxit=maxiter, repf=true, repg=true);

subplot(121)
PyPlot.semilogy(1:maxiter, ista[2], color="blue");
PyPlot.semilogy(1:maxiter, simple[2], color="green");
PyPlot.semilogy(1:maxiter, fista[2], color="red", linestyle="dashed");
xlabel("Iterations")
ylabel("Objective function value")
ax = gca()
ax[:set_ylim]([0, 100])
ax[:legend](("PGD", "AGD (simple)", "AGD (fista)"), loc=1)

subplot(122)
PyPlot.semilogy(1:maxiter, ista[3], color="blue");
PyPlot.semilogy(1:maxiter, simple[3], color="green");
PyPlot.semilogy(1:maxiter, fista[3], color="red", linestyle="dashed");
xlabel("Iterations")
ylabel("Gradient inf norm")
ax = gca()
ax[:set_ylim]([0, 10])
ax[:legend](("PGD", "AGD (simple)", "AGD (fista)"), loc=1)

L = 1421.0872221442771
L = 1421.0872221442771
L = 1421.0872221442771

Out[35]:

PyObject <matplotlib.legend.Legend object at 0x31d71fcd0>

In [ ]: