Large-Scale Optimization








L3. Gradient Descent

TU Dortmund University, Dr. Sangkyun Lee

Unconstrained Optimization

$$ \min_{x \in R^n} \;\; f(x) $$

$f$ is a convex function

Method Choices:

  • $f$ is continuously differentiable (smooth) $\rightarrow$ gradient descent
  • $f$ is not differentiable (non-smooth) $\rightarrow$ subgradient methods (later in this semester)

Optimization Algorithms

Many optimization algorithms consist of:

For $k=1,2,\dots$

  • Choose a search direction $s_k$
  • Choose a step size $\alpha_k$
  • Take a step: $x_{k+1} = x_k + \alpha_k s_k$
  • Check for convergence (optimality)

In gradient descent, we use $s_k = -\nabla f(x_k)$ (steepest descent dir)

Linear Regression

Find a linear function $f_w(x) = x^T w$ parametrized by $w$ that makes smallest square prediction error.

Dataset: $\{ (x_i, y_i) \}_{i=1}^m$, $x_i \in R^n$, $y_i \in R$

Data (design) matrix $X\in R^{m\times n}$: $$ y = \begin{bmatrix} y_1 \\ y_2 \\ \vdots \\ y_m \end{bmatrix}, \quad X = \begin{bmatrix} \longleftarrow & x_1^T & \longrightarrow \\ \longleftarrow & x_2^T & \longrightarrow \\ \vdots & \vdots & \vdots \\ \longleftarrow & x_m^T & \longrightarrow \end{bmatrix}, \quad w = \begin{bmatrix} w_1 \\ w_2 \\ \vdots \\ w_n \end{bmatrix} $$

Objective: $$ L(w) = \frac12 \sum_{i=1}^m (y_i - x_i^Tw)^2 = \frac12 \|y - Xw\|^2 $$

Linear Regression

Objective: $$ L(w) = \frac12 \|y - Xw\|^2 $$

Gradient : $$ \nabla L(w) = -X^T (y-Xw) \in R^n $$

Convergence Check

When $f$ is convex, then $x^*$ is a minimizer of $$ \min_{x\in R^n} \;\; f(x) $$ if and only if (iff) $$ \nabla f(x^*) = 0. $$ A natural stopping criterion is therefore $$ \| \nabla f(x_k) \| < \epsilon $$ for some small $\epsilon >0$.

Considerations for Large $n$ or $m$

Scaling: how does $L(w) = \frac12 \|y-Xw\|^2$ scale with large $m$?

Stopping criterion: $\|\nabla f(x_k)\| < \epsilon$. Which norm will be better?

Gradient Descent: Steepest Descent View

In [23]:
%%tikz --scale 1 --size 1600,1600 -f png
\draw[->] (-1,0) -- (2.5,0) node[right] {\scriptsize $R^n$};
\draw[domain=-1:2.2, smooth, variable=\x, blue] plot({\x}, {.5+.5*\x*\x});
\draw[domain=0:2.2, thick, variable=\y, red] plot({\y}, {1 + (\y-1)});
\draw[dotted] (1,1) -- (1,0) node[below] {\tiny $x_k$};
\draw[dotted] (1.8,.5+.5*1.8^2) -- (1.8,0) node[below] {\tiny $x_k + s_k$};
\fill (1,1) circle [radius=1pt];
\fill (1.8,.5+.5*1.8^2) circle [radius=1pt];
\fill (1.8,1.8) circle [radius=1pt];
\draw[dashed,->] (1.8,.5+.5*1.8^2) -- (3.5, 1+.5*1.8^2) node[right] {$f(x_k + s_k)$};
\draw[dashed,->] (1.8,1.8) -- (7, .7+.5*1.8^2) node[above] {$\approx f(x_k) + \nabla f(x_k)^T s_k$};

We want: $~~f(x_k+s_k) \le f(x_k)$

\begin{align} s_k &= \arg\min_{s\in R^n} \nabla f(x_k)^T s\\ &\qquad \text{s.t.}\; \|s\|=\|\nabla f(x_k)\| \\ &= -\nabla f(x_k) \end{align}
$$ x_{k+1} = x_k + \alpha_k s_k \;\; \Rightarrow \;\; x_{k+1} = x_k - \alpha_k \nabla f(x_k) $$

Gradient Descent: Quadratic Approximation View

In [17]:
%%tikz --scale 1 --size 1600,1600 -f png
\draw[->] (-1,0) -- (2.5,0) node[right] {\scriptsize $R^n$};
\draw[domain=-1.5:2.2, smooth, variable=\x, blue] plot({\x}, {.5+.5*\x*\x});
\draw[domain=-.9:2, thick, variable=\y, red] plot({\y}, {1 + (\y-1) + (\y-1)*(\y-1)});
\draw[dotted] (1,1) -- (1,0) node[below] {\tiny $x_k$};
\draw[dotted] (.5,.75) -- (.5,0) node[below] {\tiny $x_{k+1}$};
\fill (1,1) circle [radius=1pt];
\fill (.5,.75) circle [radius=1pt];
\draw[dashed,->] (1.8,1+.8+.8^2) -- (3, 1.5+.5*1.8^2) node[right] {$f(x) \approx f(x_k) + \nabla f(x_k)^T(x-x_k) + \frac{1}{2\alpha_k}\|x-x_k\|^2$};

Take the minimizer of RHS as the next iterate $x_{k+1}$:

$$ \nabla f(x_k) + (x_{k+1}-x_k)/\alpha_k = 0 \;\; \Rightarrow \;\; x_{k+1} = x_k - \alpha_k \nabla f(x_k) $$

Stepsize

Convergence of GD depends on how we choose stepsizes:

  • Exact linesearch
  • Inexact linesearch
  • Fixed stepsize
  • Decreasing stepsize

The best choice is dependent on the property of the objective function:

  • Lipschitz continuity of gradients
  • Strong convexity

Fixed Stepsize

Take $\alpha_k =c$ for a constant $c>0$: $x_{k+1} = x_k - c s_k$.

  • What if $c$ is too small?
  • What if $c$ is too large?
  • Not easy to determine the "right" size : in many cases, the choice also depends on the data matrix.

Surprisingly many ML papers do something like:

We designed a gradient descent algorithm with a fixed stepsize 0.1, which worked fine for the five datasets we tried.

What can be problematic?

Exact Linesearch

Given $x_k$ and $s_k$, find $\alpha_k$ such that

$$ \alpha_k = \arg\min_{\alpha>0} \; f(x_k + \alpha s_k) $$

Solving this is not easy in general.

Exception: when $f$ is quadratic: $\;\;f = \frac12 x^T H x + c^Tx$

$$ \begin{aligned} &f(x+\alpha s) = \frac{1}{2} (x+\alpha s)^T H (x + \alpha s) + c^T (x + \alpha s) \\ & \frac{\partial f(x+\alpha s)}{\partial \alpha} = s^T H (x + \alpha s) + c^T s = 0\\ &\Rightarrow \alpha_k = \frac{-\nabla f(x_k)^T s_k} {s_k^T H s_k} \end{aligned} $$

Inexact Linesearch


  • Wolfe Linesearch

    • Find a stepsize satisfying “sufficient decrease” & “curvature” conditions
  • Backtracking (Armijo) Linesearch

    • Simpler version using only the “sufficient decrease” condition
    • In practice, it performs as good as more complicated linesearch methods

Backtracking Linesearch

Find $\alpha$ satisfying Armijo condition for some $c_1 \in (0,1)$:

$$ f(x_k + \alpha s_k) \le f(x_k) + c_1 \alpha \nabla f(x_k)^T s_k $$

armijo

Backtracking Linesearch

Given $\alpha_0$, $\eta \in (0,1)$, and $c_1 \in (0,1)$:

  • $\alpha \leftarrow \alpha_0$
  • For i=1:maxiter,
    • If $\alpha$ satisfies the Armijo condition, return $\alpha$. $$ f(x_k + \alpha s_k) \le f(x_k) + c_1 \alpha \nabla f(x_k)^T s_k $$
    • Otherwise, $\alpha \leftarrow \eta \alpha$

Computation Cost?

  • Compute once: $f(x_k)$, $\nabla f(x_k)$, $\nabla f(x_k)^T s_k$.
  • Compute multiple times: $f(x_k + \alpha s_k)$

LS will be practical when the function evaluation is not too expensive.

Two Important Properties of the Objective Function


  • Lipschitz continuity
  • Strong convexity


  • These properties simply the analysis of many optim algorithms: therefore very popular
  • They're like the two sides of the same coin

Lipschitz Continuity

Let $f: R^n \to R$ differentiable. $\nabla f$ is Lipschitz continuous if

$$ \| \nabla f(y) - \nabla f(x) \| \le L \|y-x\|, \;\; \forall x, y \in R^n $$

for some constant $L>0$ (Lipschitz constant)

  • Lipschitz continuity can be defined for any function, e.g. $f$, $\nabla^2 f$, etc.

Strong Convexity

Let $f: R^n \to R$ differentiable. $f$ is strongly convex if

$$ f((1-\lambda)y + \lambda x) \le (1-\lambda) f(y) + \lambda f(x) - \frac{\alpha}{2}\lambda (1-\lambda) \|y-x\|^2 $$

$\forall x, y \in R^n, \forall \lambda \in [0,1], \; \text{for some $\alpha>0$} $

  • Sometimes $f$ is called $\alpha$-strongly convex.
  • If $\alpha=0$, then the definition says $f$ is convex.

Equivalent Conditions

$\nabla f$ is Lipschitz continuous with $L>0$ if and only if $$ \begin{aligned} & f(y) \le f(x) + \langle \nabla f(x), y-x \rangle + \frac{L}{2} \|y-x\|^2, \;\; \forall x,y \in R^n \\ & \langle \nabla f(y) - \nabla f(x), y-x \rangle \le L \|y-x\|^2 \;\; \forall x,y \in R^n \end{aligned} $$

$f$ is $\alpha$-strongly convex iff $$ \begin{aligned} & f(y) \ge f(x) + \langle \nabla f(x), y-x \rangle + \frac{\alpha}{2} \|y-x\|^2, \;\; \forall x,y \in R^n \\ & \langle \nabla f(y) - \nabla f(x), y-x \rangle \ge \alpha \|y-x\|^2, \;\; \forall x,y \in R^n \end{aligned} $$

What do they imply?

Lipschitz continuity of $\nabla f$:

liplip

Growth of $f$ is bounded above by a quadratic function everywhere

Strong convexity of $f$:

sc sc

Growth of $f$ is bounded below by a quadratic function everywhere

What do they imply?

If $\nabla^2 f$ exists, then

  • $\nabla f$ Lipschitz continuous with $L>0$ $\;\Rightarrow\;$ $\nabla^2 f(x) \preceq L I_n$
  • $f$ is $\alpha$-strongly convex $\;\Rightarrow\;$ $\nabla^2 f(x) \succeq \alpha I_n$

Ex. Linear regression $$ f(w) = \frac12 \|y - Xw\|^2 $$

$$ \nabla f(w) = -X^T(y-Xw), \;\; \nabla^2 f(w) = X^TX $$
  • $L \ge $ the largest eigenvalue of $X^T X$, i.e. $\|X^T X\|$.
  • $\alpha \le $ the smallest eigenvalue of $X^T X$

How to estimate $\|X\|$?

In [24]:
function norm_est(X)
    v = rand(n);
    v = v / vecnorm(v);
    return(vecnorm(X*v));
end

m=1000;
n=1000;
X = rand(m,n);
    
@time L = norm(X,2);
@time Lest = norm_est(X)
@printf "L=%f, Lest=%f\n" L Lest
 15.532064 seconds (32 allocations: 763.524 MB, 0.01% gc time)
  0.065352 seconds (679 allocations: 832.342 KB)
L=5000.491537, Lest=4327.923540

Convergence Rate of GD


Fixed stepsize $c$ Backtracking LS
$f$ Convex $c \le \frac{1}{L}$
$O(1/k)$ $O(1/k) $
$f$ Strongly Convex $c \le \frac{2}{\alpha + L}$
$O(\gamma^k)$ $O(\gamma^k)$

$\gamma\in(0,1)$ depends on the condition number of $\nabla^2 f(x)$, i.e. $$ \frac{L}{\alpha} = \frac{\lambda_\max(\nabla^2 f(x))}{\lambda_\min(\nabla^2 f(x))} $$