Large-Scale Optimization








$\newcommand{\R}{\mathbb{R}} \newcommand{\E}{\mathbb{E}} \DeclareMathOperator*{\argmin}{arg\,min} \DeclareMathOperator*{\argmax}{arg\,max} \newcommand{\prox}{\text{prox}} $

Ch11. Proximal Gradient Descent

TU Dortmund University, Dr. Sangkyun Lee

Proximal Gradient Decent

Proximal Gradient Decent is a recent optimization technique based on:

  • Decomposition
  • Proximal operator

PGD is also known as

  • Generalized gradient descent
  • Forward-backward splitting
  • ISTA (Iterative Shrinkage-Thresholding Algorithm)

Recall: Gradient Descent in the Quadratic Approximation View

In [4]:
%%tikz --scale 1 --size 1600,1600 -f png
\draw[->] (-1,0) -- (2.5,0) node[right] {\scriptsize $R^n$};
\draw[domain=-1.5:2.2, smooth, variable=\x, blue] plot({\x}, {.5+.5*\x*\x});
\draw[domain=-.9:2, thick, variable=\y, red] plot({\y}, {1 + (\y-1) + (\y-1)*(\y-1)});
\draw[dotted] (1,1) -- (1,0) node[below] {\tiny $x_k$};
\draw[dotted] (.5,.75) -- (.5,0) node[below] {\tiny $x_{k+1}$};
\fill (1,1) circle [radius=1pt];
\fill (.5,.75) circle [radius=1pt];
\draw[dashed,->] (1.8,1+.8+.8^2) -- (3, 1.5+.5*1.8^2) node[right] {$f(x) \approx f(x_k) + \nabla f(x_k)^T(x-x_k) + \frac{1}{2\alpha_k}\|x-x_k\|^2$};

Recall: Subgradient Method for Nonsmooth Minimization

$$ \min_{x\in\R^n} \;\; f(x), \quad f: \text{ nonsmooth} $$

Update: $$ x_{k+1} = x_k - \alpha_k g_k, \;\; g_k \in \partial f(x_k) $$

Keep $x_\text{best}$ ($f(x_{k+1}) \le f(x_k)$ is not guaranteed)

Convergence:

  • Constant stepsize: $\alpha_k = \frac{D}{M\sqrt{K}}$ $\Rightarrow$ $f(x_{K, \text{best}}) - f(x^*) \le \frac{DM}{2\sqrt{K}}$
  • Diminishing stepsize: $\alpha_k = \frac{D}{M\sqrt{k}}$ $\Rightarrow$ $f(x_{k, \text{best}}) - f(x^*) \le \mathcal O\left( \frac{DM}{2}\frac{\ln k}{\sqrt{k}} \right)$

Decomposition

$$ \min_{x\in\R^n} \;\; f(x) = F(x) + \Psi(x) $$

Suppose that the objective $f$ can be decomposed into two parts:

  • $F$ convex and smooth
  • $\Psi$ convex, nonsmooth but "simple"
  • $f$ is often called as a "composite" objective

Ex. Convex regularization problems

  • LASSO: $f(w) = \frac12 \|y-Xw\|^2 + \lambda \|w\|_1$
  • Matrix completion: $f(X) = \frac12 \sum_{(i,j)\in \Omega} (A_{ij} - X_{ij})^2 + \lambda \|X\|_*$

Decomposition

$$ \min_{x\in\R^n} \;\; f(x) = F(x) + \Psi(x), \;\; \text{$F$ smooth, $\Psi$ nonsmooth} $$

After all, $f$ is nonsmooth

  • Subgradient method gives a very slow conv rate $O(1/\sqrt{k})$
  • However, we have a smooth part $F$: can we use this fact?

If $\Psi(x) = 0$, we can apply GD to do:

$ x_{k+1} = \argmin_{x} \; f(x_k) + \nabla f(x_k)^T (x-x_k) + \frac{1}{2\alpha_k} \|x-x_k\|^2 $

We consider a natural extension with a nontrivial $\Psi$:

$ x_{k+1} = \argmin_{x} \; F(x_k) + \nabla F(x_k)^T (x-x_k) + \frac{1}{2\alpha_k} \|x-x_k\|^2 + \Psi(x) $

Reformulation

$ x_{k+1} = \argmin_{x} \; F(x_k) + \nabla F(x_k)^T (x-x_k) + \frac{1}{2\alpha_k} \|x-x_k\|^2 + \Psi(x) $

Focus on: $ F(x_k) + \nabla F(x_k)^T (x-x_k) + \frac{1}{2\alpha_k} \|x-x_k\|^2 $

$\rightarrow \frac{1}{2\alpha_k} \left( \|x - x_k\|^2 + 2 \alpha_k\nabla F(x_k)^T x + \text{const} \right)$

$\rightarrow \frac{1}{2\alpha_k} \left( x^T x -2 (x_k - \alpha_k \nabla F(x_k))^T x + \text{const} \right)$

$\rightarrow \frac{1}{2\alpha_k} \| x - (x_k - \alpha_k \nabla F(x_k)) \|^2 + \text{const} $

Proximal Operator

Therefore, the generalized gradient descent updates can be written as $$ x_{k+1} = \argmin_{x} \; \left\{ \frac{1}{2\alpha_k}\| x - (x_k -\alpha_k \nabla F(x_k)) \|^2 + \Psi(x) \right\} $$

In general, we define the proximal operator

$$ \prox_{h}(z) := \argmin_{x\in\R^n} \; \left\{ \frac{1}{2} \|x - z\|^2 + h(x) \right\} $$

Then the update rule becomes simply, $$ x_{k+1} = \prox_{\alpha_k \Psi} ( x_k - \alpha_k \nabla F(x_k)) $$

Proximal Operator

$$ \prox_{h}(z) := \argmin_{x\in\R^n} \; \left\{ \frac{1}{2} \|x - z\|^2 + h(x) \right\} $$

Consider that $h(x) = I_C(x) = \begin{cases} 0 & \text{if x $\in$ C} \\ +\infty & \text{o.w.} \end{cases}$ for a convex set $C$. What is the prox equivalent to?

For more properties: Proximal Algorithms, Parikh and Boyd, Foundations and Trends in Optimization, 2013

https://web.stanford.edu/~boyd/papers/pdf/prox_algs.pdf

Generalized GD

$$ x_{k+1} = \text{prox}_{\alpha_k \Psi} ( x_k - \alpha_k \nabla F(x_k)) $$

We can rewrite the expression as follows:

$$ x_{k+1} = x_k - \alpha_k \frac{x_k - \text{prox}_{\alpha_k \Psi} ( x_k - \alpha_k \nabla F(x_k))}{\alpha_k} $$

And therefore, $$ x_{k+1} = x_k - \alpha_k G_{\alpha_k}^\Psi(x_k) $$

$G_{\alpha_k}^\Psi(x_k)$ can be seen as a generalized gradient

Forward-Backward Splitting

$$ x_{k+1} = \argmin_{x} \; \left\{ \frac{1}{2\alpha_k}\| x - (x_k -\alpha_k \nabla F(x_k)) \|^2 + \Psi(x) \right\} $$

This can be seen as a two-steps procedure:

\begin{align*} x_{k+\frac{1}{2}} & = x_k -\alpha_k \nabla F(x_k)\\ x_{k+1} & = \argmin_{x} \; \left\{ \frac{1}{2\alpha_{k+\frac{1}{2}}}\| x - x_{k+\frac{1}{2}} \|^2 + \Psi(x) \right\} \end{align*}

Therefore, for $g_\Psi(x_{k+1}) \in \partial \Psi(x_{k+1})$, $$ x_{k+1} = x_k - \underbrace{\alpha_k \nabla F(x_k)}_{\text{backward-looking}} + \underbrace{\alpha_{k+\frac{1}{2}}g_\Psi(x_{k+1})}_{\text{forward-looking}} $$

Simple Nonsmooth Part

Each iteration needs to compute the prox operation:

$$ \text{prox}_{\Psi}(z) := \argmin_{x\in\R^n} \; \left\{ \frac{1}{2} \|x - z\|^2 + \Psi(x) \right\} $$

We call $\Psi$ "simple", if its prox op can be solved efficiently

Many interesting regularizers $\Psi$ have this property

Ex. $\Psi(x) = \lambda \|x\|_1$

$ \text{prox}_{\alpha_k \lambda \|x\|_1}(z) := \argmin_{x\in\R^n} \; \left\{ \frac{1}{2} \|x - z\|^2 + \alpha_k \lambda \|x\|_1 \right\} $

$0 \in x^*-z + \alpha_k \lambda g, \;\; g \in \partial \|x^*\|_1$

$x_i^* = z_i - \alpha_k \lambda g_i, \;\; g_i = \begin{cases} +1 & x_i^* > 0 \\ -1 & x_i^* < 0 \\ [-1,1] & x_i^*=0 \end{cases}$

$[\text{prox}_{\alpha_k \lambda \|x\|_1}(z)]_i = \begin{cases} z_i - \alpha_k \lambda & z_i > \alpha_k \lambda \\ z_i + \alpha_k \lambda & z_i < -\alpha_k \lambda\\ 0 & z_i \in \alpha_k\lambda [-1,1] \end{cases}$

This is so-called the soft-thresholding operator

Proximal GD: Convergence

$$ \min_{x\in\R^n} \;\; f(x) = F(x) + \Psi(x) $$

$F$ is $L$-strongly smooth, i.e.,

$$ \| \nabla F(x) - \nabla F(y) \|_2 \le L \|x - y\|_2, \;\; \forall x, y \in\R^n $$

Then we can show for PGD using a constant stepsize $\alpha \le \frac{1}{L}$,

$$ f(x_{k}) - f(x^*) \le \frac{\|x_0 - x^*\|^2}{2\alpha k} $$

$\nabla F$ is $L$-Lipschitz continuous,

$\Rightarrow f(y) \le F(x) + \nabla F(x)^T (y-x) + \frac{L}{2}\|y-x\|^2 + \Psi(y), \;\; \forall x, y$

Using the update rule: $x_{k+1} = x_k - \alpha_k G_{\alpha_k}^\Psi(x_k)$,

$f(x_{k+1}) \le F(x_k) -\alpha_k \nabla F(x_k)^T G_{\alpha_k}^\Psi(x_k) + \alpha_k^2 \frac{L}{2} \|G_{\alpha_k}^\Psi (x_k)\|^2 + \Psi(x_{k+1}) \qquad (*)$

On the other hand, (recall $u = \text{prox}_h(z) \;\; \Leftrightarrow \;\; z - u \in \partial h(u)$)

$x_{k+1} = x_k - \alpha_k G_{\alpha_k}^\Psi(x_k) = \text{prox}_{\alpha_k \Psi}(x_k - \alpha_k \nabla F(x_k))$

$\Rightarrow G_{\alpha_k}^\Psi(x_k) - \nabla F(x_k) \in \partial \Psi(x_k - \alpha G_{\alpha_k}^\Psi(x_k))$

From the definition of a subgradient,

$ \Psi(x^*) \ge \Psi(x_{k+1}) + (G_{\alpha_k}^\Psi(x_k) - \nabla F(x_k))^T(x^* - x_{k+1}) \qquad (**) $

$f(x_{k+1}) \le F(x_k) -\alpha_k \nabla F(x_k)^T G_{\alpha_k}^\Psi(x_k) + \alpha_k^2 \frac{L}{2} \|G_{\alpha_k}^\Psi (x_k)\|^2 + \Psi(x_{k+1}) \qquad (*)$

$\Psi(x^*) \ge \Psi(x_{k+1}) + (G_{\alpha_k}^\Psi(x_k) - \nabla F(x_k))^T(x^* - x_{k+1}) \qquad (**) \Rightarrow (*)$

$F(x^*) \ge F(x_k) + \nabla F(x_k)^T (x^*-x_k) \qquad (***) \Rightarrow (*)$

\begin{align*} f(x_{k+1}) &\le f(x^*) - \nabla F(x_k)^T (x^*-x_k)\\ & -(G_{\alpha_k}^\Psi (x_k)-\nabla F(x_k))^T(x^* - x_k + \alpha_k G_{\alpha_k}^\Psi (x_k)) \\ &- \alpha_k \nabla F(x_k)^T G_{\alpha_k}^\Psi(x_k) + \alpha_k^2 \frac{L}{2} \|G_{\alpha_k}^\Psi (x_k)\|^2 \end{align*}

$f(x_{k+1}) \le f(x^*) + G_{\alpha_k}^\Psi (x_k)^T (x_k - x^*) + \alpha_k \left( \frac{\alpha_kL}{2} - 1\right) \|G_{\alpha_k}^\Psi (x_k)\|^2 $

$f(x_{k+1}) \le f(x^*) + G_{\alpha_k}^\Psi (x_k)^T (x_k - x^*) + \alpha_k \left( \frac{\alpha_kL}{2} - 1\right) \|G_{\alpha_k}^\Psi (x_k)\|^2$

Choosing $\alpha_k = \alpha \le \frac{1}{L}$,

$$f(x_{k+1}) \le f(x^*) + G_{\alpha_k}^\Psi (x_k)^T (x_k - x^*) -\frac{\alpha_k}{2} \|G_{\alpha_k}^\Psi (x_k)\|^2$$

Using the fact that $G_{\alpha_k}^\Psi (x_k) = \frac{1}{\alpha_k} (x_k - x_{k+1})$,

$$f(x_{k+1}) \le f(x^*) + \frac{1}{2\alpha_k} \left( \|x_k - x^*\|^2 - \|x_{k+1} - x^*\|^2 \right)$$

Therefore,

$$ \sum_{i=1}^{k} (f(x_{i}) - f(x^*)) \le \frac{1}{2\alpha} \left( \|x_0 - x^*\|^2 - \|x_{k} - x^*\|^2 \right) \le \frac{1}{2\alpha} \|x_0 - x^*\|^2 $$
$$ \sum_{i=1}^{k} (f(x_{i}) - f(x^*)) \le \frac{1}{2\alpha} \|x_0 - x^*\|^2 $$

If $f(x_k)$ is nonincreasing, then

$$ k (f(x_k) - f(x^*) \le \sum_{i=1}^{k} (f(x_{i}) - f(x^*)) \le \frac{1}{2\alpha} \|x_0 - x^*\|^2 $$

Therefore,

$$ f(x_{k}) - f(x^*) \le \frac{\|x_0 - x^*\|^2}{2\alpha k} $$

Monotonicity

Repeat the analysis with

$f(x_{k+1}) \le F(x_k) -\alpha_k \nabla F(x_k)^T G_{\alpha_k}^\Psi(x_k) + \alpha_k^2 \frac{L}{2} \|G_{\alpha_k}^\Psi (x_k)\|^2 + \Psi(x_{k+1}) \qquad (*)$

$\Psi(x_k) \ge \Psi(x_{k+1}) + (G_{\alpha_k}^\Psi(x_k) - \nabla F(x_k))^T(x_k - x_{k+1}) \qquad (**) \Rightarrow (*)$

\begin{align*} f(x_{k+1}) &\le f(x_k) -(G_{\alpha_k}^\Psi (x_k)-\nabla F(x_k))^T(\alpha_k G_{\alpha_k}^\Psi (x_k)) \\ &- \alpha_k \nabla F(x_k)^T G_{\alpha_k}^\Psi(x_k) + \alpha_k^2 \frac{L}{2} \|G_{\alpha_k}^\Psi (x_k)\|^2 \end{align*}

To obtain:

\begin{align*} f(x_{k+1}) & \le f(x_k) + \alpha_k \left( \frac{\alpha_kL}{2} - 1\right) \|G_{\alpha_k}^\Psi (x_k)\|^2\\ &\le f(x_k) \qquad \text{if $\alpha_k \le \frac{1}{L}$} \end{align*}

Convergence with Backtracking LS

$\beta \in (0,1)$

In each iteration $k$, search for a stepsize $\alpha$ (beginning from $\alpha=1$) satisfying:

$$ f(x_k - \alpha G_{\alpha}^\Psi) \le f(x_k) - \frac{\alpha}{2} \|G_\alpha^\Psi(x_k)\|^2 $$

while updating $\alpha = \beta \alpha$.

Then we have

$$ f(x_{k}) - f(x^*) \le \frac{\|x_0 - x^*\|^2}{2\alpha_{\min} k}, \qquad \alpha_{\min} = \min\{1, \beta / L\} $$

Acceleration

A technique to achieve the optimal rate $O\left(\frac{1}{k^2}\right)$ in the first-order method

Ideas by Yurii Nesterov

  • 1983: A method for unconstrained convex minimization with the rate of convergence O(1/k^2), Doklady AN SSSR 269
  • 1988: On an approach to the construction of optimal methods of minimization of smooth convex functions, Èkonom. i. Mat. Metody 24
  • 2005: Smooth minimization of nonsmooth functions, Math. Program 103
  • 2007: Gradient methods for minimizing composite objective function, CORE report (Math. Program, 2013)

FISTA (Fast ISTA) [Beck & Teboulle, SIAM J. Imaging Sci., 2009]

Extension of Nesterov's first (1983) idea for decomposable objective function

$y_1 = x_0 \in \R^n$, $t_1 = 1$

For $k=1,2,\dots$:

\begin{align*} x_k &= \prox_{\alpha_k\Psi} (y_k - \alpha_k\nabla F(y_k))\\ t_{k+1} &= \frac{1+\sqrt{1+4t_k^2}}{2}\\ y_{k+1} &= x_k + \left( \frac{t_k-1}{t_{k+1}} \right) (x_k - x_{k-1}) \end{align*}

We choose $\alpha_k = 1/L$ as in the proximal GD.

A Simpler Version

\begin{align*} y_k &= x_k + \frac{k-1}{k+2} (x_k - x_{k-1})\\ x_{k+1} &= \prox_{\alpha_k \Psi} (y_k - \alpha_k \nabla F(y_k)) \end{align*}
In [5]:
%%tikz --scale 1 --size 1100,1100 -f png
\draw[->] (-1,0) -- (2.5,0) node[right] {\scriptsize $R^n$};
\draw[domain=-1.5:2.2, smooth, variable=\x, blue] plot({\x}, {.5+.5*\x*\x});
\draw[domain=-.9:2, thick, variable=\y, red] plot({\y}, {1 + (\y-1) + (\y-1)*(\y-1)});
\draw[dotted] (1,1) -- (1,0) node[below] {\tiny $x_k$};
\draw[dotted] (.5,.75) -- (.5,0) node[below] {\tiny $x_{k+1}$};
\fill (1,1) circle [radius=1pt];
\fill (.5,.75) circle [radius=1pt];
\draw[dashed,->] (1.8,1+.8+.8^2) -- (3, 1.5+.5*1.8^2) node[right] {$f(x) \approx f(x_k) + \nabla f(x_k)^T(x-x_k) + \frac{1}{2\alpha_k}\|x-x_k\|^2$};
In [6]:
%%tikz --scale 1 --size 1100,1100 -f png
\newcommand{\tikzcircle}[2][red,fill=red]{\tikz[baseline=-0.5ex]\draw[#1,radius=#2] (0,0) circle ;}%

\draw[->] (-1,0) -- (2.5,0) node[right] {\scriptsize $R^n$};
\draw[domain=-1.5:2.2, smooth, variable=\x, blue] plot({\x}, {.5+.5*\x*\x});
\draw[domain=-.9:2, thick, variable=\y, red] plot({\y}, {.5+.5*.7^2 + .7*(\y-.7) + (\y-.7)*(\y-.7)});
\draw[thick, ->] (1.5, 0) -- (.7,0); 
\fill (1.5,.5+.5*1.5^2) circle [radius=1pt]; \draw[dotted] (1.5, .5+.5*1.5^2) -- (1.5, 0) node[below] {\tiny $x_{k-1}$};
\fill (1,1) circle [radius=1pt]; \draw[dotted] (1,1) -- (1,0) node[below] {\tiny $x_k$};
\draw [red,fill=red] (.7,.5+.5*.7^2) circle [radius=1pt]; \draw[dotted] (.7,.5+.5*.7^2) -- (.7,0) node[above, red] {\tiny $y_{k}$};
\fill (.35, .5+.35^2) circle [radius=1pt]; \draw[dotted] (.35, .5+.35^2) -- (.35,0) node[below] {\tiny $x_{k+1}$};
\draw[dashed,->] (1.8,1+.8+.8^2) -- (3, 1.5+.5*1.8^2) node[right] {$f(x) \approx f(y_k) + \nabla f(y_k)^T(x-y_k) + \frac{1}{2\alpha_k}\|x-y_k\|^2$};
\draw[dashed,->] (1.1, 0.1) -- (3,2) node[right] {\scriptsize $x_k + \tau_k (\theta_k) (x_k-x_{k-1})$};

AGD: Convergence Rate

AGD with the constant stepsize $\alpha_k = 1/L$ gives

$$ f(x_k) - f(x^*) \le \frac{2\|x_0 - x^*\|^2}{\alpha (k+1)^2} $$

This achieves the optimal rate for the first-order optimization

To get an $\epsilon$ suboptimal solution, we need $O(1/\sqrt{\epsilon})$ iterations!

AGD with LS: Convergence Rate

$$ f(x_k) - f(x^*) \le \frac{2\|x_0 - x^*\|^2}{\alpha_{\min} (k+1)^2}, \qquad \alpha_{\min} := \min\{1, \beta/L\} $$

PGD vs. AGD

Any property of PGD that is missing in AGD?

In [1]:
include("../_julia/julia/lasso.ISTA.jl")
include("../_julia/julia/lasso.FISTA.jl")

using Distributions
using PyPlot

m = 100
n = 100

lam = .001

X = rand(Normal(0, 2), (m,n))
wstar = rand(Normal(1.2, 2.0), (n,1))
y = X * wstar;

maxiter = 1000;

ista = lassoISTA(X, y, step="const", lam=lam, maxit=maxiter, repf=true, repg=true);
simple = lassoFISTA(X, y, step="simple", lam=lam, maxit=maxiter, repf=true, repg=true);
fista = lassoFISTA(X, y, step="const", lam=lam, maxit=maxiter, repf=true, repg=true);

subplot(121)
PyPlot.semilogy(1:maxiter, ista[2], color="blue");
PyPlot.semilogy(1:maxiter, simple[2], color="green");
PyPlot.semilogy(1:maxiter, fista[2], color="red", linestyle="dashed");
xlabel("Iterations")
ylabel("Objective function value")
ax = gca()
ax[:set_ylim]([0, 100])
ax[:legend](("PGD", "AGD (simple)", "AGD (fista)"), loc=1)

subplot(122)
PyPlot.semilogy(1:maxiter, ista[3], color="blue");
PyPlot.semilogy(1:maxiter, simple[3], color="green");
PyPlot.semilogy(1:maxiter, fista[3], color="red", linestyle="dashed");
xlabel("Iterations")
ylabel("Gradient inf norm")
ax = gca()
ax[:set_ylim]([0, 10])
ax[:legend](("PGD", "AGD (simple)", "AGD (fista)"), loc=1)
tight_layout()
L = 1481.5710554161046
L = 1481.5710554161046

Exercises

Ex 11.1 (Subgradient characterization) Suppose that $h:\R^n\to \R$ is a convex function. Show that the following is true:

$$ u = \text{prox}_h(z) \;\; \Leftrightarrow \;\; z - u \in \partial h(u), $$

and therefore $$ h(z) - h(u) \ge \|z-u\|_2^2 $$

Ex 11.2 (Optimality Condition) Suppose that we solve the following problem:

$$ \min_{x\in\R^n} \;\; f(x) := F(x) + \Psi(x) $$

where $F:\R^n\to\R$ and $\Psi:\R^n\to\R$ are proper convex functions. $F$ is also continuously differentiable. In this case, the first order sufficient and necessary condition for convex optimization applies in a general form, i.e.,

$$ x^* \text{ is a global minimizer} \;\; \Leftrightarrow \;\; 0 \in \partial f(x^*) $$
  • How would you check this in your PGD algorithm? Explain it for the special case of $\Psi(x) = \lambda \|x\|_1$ (where $\lambda>0$ is given).
  • We can try SGD to solve this problem. How would you check the optimality in case of SGD?
In [ ]: