Large-Scale Optimization¶

$\newcommand{\R}{\mathbb{R}} \newcommand{\E}{\mathbb{E}}$

L5. Stochastic Gradient Descent 2¶

TU Dortmund University, Dr. Sangkyun Lee

In [ ]:

Stochastic (Sub)Gradient Descent (SGD)¶

Subgradient method + Randomness

Motivation

Very low per iteration cost
which may compensate slow convergence

Consider an objective with a very large sum over $m$ data points:

$$ \min_{x\in X \subset \R^n} \;\; \frac{1}{m}\sum_{i=1}^m \ell(x; D_i) \;\; \approx \;\; \E [ \ell(x; D) ] $$

where $\ell(\cdot; D_i): \R^n \to \R$ is a convex function.

SA/SGD Assumptions¶

$$ \min_{x \in X} \;\; f(x) = \E[ F(x;\xi_i) ] = \int_\Xi F(x;\xi) dP(\xi) $$

Assumptions:

$F(\cdot;\xi)$ is convex for each $\xi \in \Xi$. This implies $f(\cdot)$is convex.
I.i.d. samples $\xi_1,\dots,\xi_m$ of the random variable $\xi$ available.
$X$ is a nonempty compact convex set
$f$ is proper
We can make unbiased estimates $G$ of subgradients: $$ \E[G] = g \in \partial f(x) $$

Two constants:

$\E[ \|G(x,\xi)\|^2 ] \le M^2, \;\; \forall x$
$D = \max_{x\in X} \|x-x_1\|$

Classical SGD ($f$ is $c$-strongly convex)¶

Make the following change to GD: $k=1,2,\dots,$:

$$ x_{k+1} = \prod_{X} (x_k - \alpha_k G(x_k, \xi_k)) $$

$\xi_k \in \{1,\dots,m\}$ is a random index choosing a data point.
$\prod_X$ is the Euclidean projection onto the set $X$.

Choose stepsize $\alpha_k = \frac{\theta}{k}$ with $\theta > \frac{1}{2c}$.

Random Variable Dependency¶

Given a starting point $x_1 \in \R^n$.

For $k=1,2,\dots$: $$ x_{k+1} = \prod_{X} (x_k - \alpha_k G(x_k, \xi_k)) $$

$x_2$ $\;\leftarrow\;$ $x_1$, $\xi_1$.
$x_3$ $\;\leftarrow\;$ $x_2$, $\xi_2$ $\;\leftarrow\;$ $x_1$, $\{\xi_1,\xi_2\}$
...
$x_k$ $\;\leftarrow\;$ $x_1$, $\{\xi_1,\xi_2,\dots,\xi_{k-1}\} =: \xi_{[k-1]}$

$G(x_1,\xi_1) \;\leftarrow\; x_1, \xi_1$
$G(x_2,\xi_2) \;\leftarrow\; x_2, \xi_2 \;\leftarrow\; x_1, \{\xi_1,\xi_2\}$
...
$G(x_k,\xi_k) \;\leftarrow\; x_1, \xi_{[k]}$

Strong Convexity¶

For all $x,y \in X$,

$$ f(y) \ge f(x) + g(x)^T (y-x) + \frac{c}{2} \|y-x\|^2, \;\; \forall g(x) \in \partial f(x) $$

$$ f(x) \ge f(y) + g(y)^T (x-y) + \frac{c}{2} \|y-x\|^2, \;\; \forall g(y) \in \partial f(y) $$

Adding them together, we get

$$ (g(y) - g(x))^T (y-x) \ge c \|y-x\|^2 $$

In particular,

$$ (g(x_k) - g(x^*))^T (x_k - x^*) \ge c \|x_k -x^*\|^2 $$

Fundamental Optimality Condition¶

$x^*$ is a minimizer of the convex optimization $\min_{x\in X} f(x)$ iff

$$ (x_k - x^*)^T g(x^*) \ge 0, \;\; \forall x_k \in X, \forall g(x^*) \in \partial f(x^*) $$

fund_opt

From the two results, $\forall g(x_k) \in \partial f(x_k)$ and $\forall g(x^*) \in \partial f(x^*)$

$$ \begin{cases} (g(x_k) - g(x^*))^T (x_k - x^*) \ge c \|x_k - x^*\|^2 & \\ (x_k - x^*)^T g(x^*) \ge 0 \end{cases} $$$$ \Rightarrow g(x_k)^T (x_k - x^*) \ge c \|x_k - x^*\|^2 \qquad (*) $$

Convergence Analysis of Classical SGD [Nemirovski et al., SIOPT, 2009]¶

Quantity of interest: $D_k := \frac12 \|x_k - x^*\|^2$ and $d_k := \E[D_k]$

$$ \begin{aligned} D_{k+1} &= \frac12 \big\| \prod_X(x_k - \alpha_k G(x_k,\xi_k)) - x^* \big\|^2 \\ &= \frac12 \big\| \prod_X(x_k - \alpha_k G(x_k,\xi_k)) - \prod_X(x^*) \big\|^2 \\ &\le \frac12 \big\| x_k - \alpha_k G(x_k,\xi_k) - x^* \|^2 \\ &= D_k + \frac12 \alpha_k^2 \|G(x_k,\xi_k)\|^2 - \alpha_k (x_k-x^*)^T G(x_k,\xi_k) \end{aligned} $$

$$ \Rightarrow\;\; d_{k+1} \le d_k + \frac12 \alpha_k^2 M^2 - \alpha_k \E[ (x_k-x^*)^T G(x_k,\xi_k) ] $$

Recall that $x_k = x_k(\xi_{[k-1]})$ is independent of $\xi_k$,

$$ \begin{aligned} \E[ (x_k-x^*)^T G(x_k,\xi_k) ] &= \E\big\{ \E[ (x_k-x^*)^T G(x_k,\xi_k) \,\big|\, \xi_{[k-1]} ] \big\} \\ &= \E\big\{ (x_k-x^*)^T \E [G(x_k,\xi_k) \,\big|\, \xi_{[k-1]} ] \big\} \\ \end{aligned} $$

$$ \qquad \quad = \E\big\{ (x_k-x^*)^T g(x_k) \big\} $$

Recall: $$ g(x_k)^T (x_k - x^*) \ge c \|x_k - x^*\|^2 \qquad (*) $$

$$ \Rightarrow\;\; \E[ (x_k-x^*)^T G(x_k,\xi_k) ] \ge c \E[ \|x_k-x^*\|^2 ] = 2c d_k $$

Finally,

$$ d_{k+1} \le d_k + \frac12 \alpha_k^2 M^2 - \alpha_k \E[ (x_k-x^*)^T G(x_k,\xi_k) ] $$$$ \Rightarrow d_{k+1} \le d_k + \frac12 \alpha_k^2 M^2 - 2c \alpha_k d_k = (1-2c\alpha_k) d_k + \frac12 \alpha_k^2 M^2 $$

Setting $\alpha_k = \frac{\theta}{k}$ for $\theta > \frac{1}{2c}$, we can show using induction that

$$ \E[ \|x_k - x^*\|^2 ] = 2 d_k \le \frac{1}{k} \max\big\{ \frac{\theta^2 M^2}{2c\theta - 1}, \|x_1-x^*\|^2 \big\} . $$

Classical SGD: Convergence¶

Choose stepsize $\alpha_k = \frac{\theta}{k}$ with $\theta > \frac{1}{2c}$.

Convergence:

Iterate convergence: $$\mathbb E [ \|x_k - x^*\|^2 ] \le \frac{1}{k} \max\left\{\frac{\theta^2M^2}{2c\theta-1}, \|x_1-x^*\|^2\right\}$$
If $f$ is diff'able and $\nabla f$ is $L$-Lipschitz continuous, and $x^* \in \text{int}\, X$: $$ \mathbb E [ f(x_k) - f(x^*) ] \le \frac{L}{2k} \max\left\{\frac{\theta^2M^2}{2c\theta-1}, \|x_1-x^*\|^2\right\} $$

Classical SGD: Issues¶

The convergence is sensitive to

$c>0$, i.e. $f$ is strongly convex
the correctness of the value of $c$

Ex. $f(x) = x^2/10$, $G(x,\xi) = \nabla f(x)$ (no noise), $X = [-1,1]$

What is the strong convexity constant $c$ of $f(x)$?

Ex. $f(x) = x^2/10$, $G(x,\xi) = \nabla f(x)$ (no noise), $X = [-1,1]$

Consider that we've chosen $c=1$, $\theta =1 > 1/(2c)$ and $x_1=1$ $\;\Rightarrow\;$ $\alpha_k = 1/k$

$$ x_{k+1} = x_k - \frac{1}{5k} x_k = \left(1-\frac{1}{5k}\right) x_k $$

$$ \begin{aligned} x_{k} &= \prod_{j=1}^{k-1} \left(1-\frac{1}{5j}\right) = \exp \left[ -\sum_{j=1}^{k-1} \ln \left( 1 + \frac{1}{5j-1} \right) \right] \\ &> \exp\left[ -\sum_{j=1}^{k-1} \frac{1}{5j-1} \right] > \exp\left[ -\left( \frac14 + \int_1^{k-1} \frac{1}{5t-1} dt \right) \right] > \exp( -.25 + .2\ln 1.25 - .2\ln k)\\ &> .8 k^{-1/5} \end{aligned} $$

With $k=10^9$, we still have $x_k > 0.015$

Ex. $f(x) = x^2/10$, $G(x,\xi) = \nabla f(x)$ (no noise), $X = [-1,1]$

With the correct value $c=0.2$:

Choosing $\theta = \frac{1}{c} = 5$, $\alpha_k = \frac{\theta}{k} = \frac{5}{k}$:

$$ x_2 = x_1 - \alpha_k \nabla f(x_1) = 0 $$

That is, it'll converge to the minimizer $x^*=0$ in one step!

In [26]:

using PyPlot

x = -1:.01:5

plot(x, log10(1+x), x, x)

ax = gca()
ax[:spines]["top"][:set_visible](false) # Hide the top edge of the axis
ax[:spines]["right"][:set_visible](false) # Hide the right edge of the axis
ax[:spines]["left"][:set_position]("zero") # Move the right axis to the center
ax[:spines]["bottom"][:set_position]("zero") # Most the bottom axis to the center
ax[:xaxis][:set_ticks_position]("bottom") # Set the x-ticks to only the bottom
ax[:yaxis][:set_ticks_position]("left") # Set the y-ticks to only the left

annotate("y=x", xy=[4.9; 4.5])
annotate("y=ln(1+x)", xy=[4.9; log10(1+4.9)+.2]);

In [27]:

using PyPlot

x=0:.01:5

plot(x, 1./(5x-1))
ylim([0,.5]);

Robust SGD¶

Differences to the classical SGD:

$f$ doesn't have to be strongly convex
Make a weighted average of iterates (from iterations $s$ to $k$)

$$ \tilde x_s^k = \sum_{t=s}^k \nu_t x_t, \;\; \text{where} \;\; \nu_t = \frac{\alpha_t}{\sum_{j=s}^k \alpha_j} : \sum_{t=s}^k \nu_t = 1 $$

$ D = \max_{x\in X} \|x-x_1\| $

Robust SGD: Convergence¶

Constant stepsize $\alpha_k = \frac{D}{M\sqrt{K}}$, $$ \E[ f(\tilde x_s^K) - f(x^*) ] \le \frac{DM}{\sqrt{K}} \left[ \frac{2K}{K-s+1} + \frac{1}{2} \right] $$

Diminishing stepsize $\alpha_k = \frac{\theta D}{M\sqrt{k}}$ for any $\theta>0$, $$ \E[ f(\tilde x_s^k) - f(x^*) ] \le \frac{DM}{\sqrt{k}} \left[ \frac{2}{\theta} \left( \frac{k}{k-s+1} \right) + \frac{\theta}{2} \sqrt{\frac{k}{s}} \right] $$

$s = \lceil rk \rceil$, $r \in (0,1)$: $$ \E[ f(\tilde x_s^k) - f(x^*) ] \le \frac{DM}{\sqrt{k}} \left[ \frac{2}{\theta} \left( \frac{1}{1-r} \right) + \frac{\theta}{2} \sqrt{r} \right] $$

In [18]:

using MNIST

Xtr,Ytr = traindata()
Xtr = Xtr' ./255;

In [22]:

# Reduce the size of X
using StatsBase

m, n = size(Xtr)
println("m=$m, n=$n")

s = 60000
idx = sample(1:m, s)

X = Xtr[idx,:]
Y = Ytr[idx];

m, n = size(X)
println("Reduced: m=$m, n=$n")

# Make it binary classfication problem
# even vs odd

oddind = map(isodd, round(Int64,Y))
Y[oddind] = -1
Y[!oddind] = +1

np = length(find(Y.==1))
nn = s - np;
println("No. classes: +1=$np, -1=$nn")

λ=0.01;

m=60000, n=784
Reduced: m=60000, n=784
No. classes: +1=29610, -1=30390

In [24]:

epochs = 5;

#
# Linear SVM min_{w,b} 1/m \sum_i^m \max(0, 1 - y(w^Tx +b)) + λ/2 \|w\|^2
#
# G: if 1 - y(w^Tx+b) > 0,    [-y x; -y] + [λ w; 0]
#    o.w. [λ w; 0]

w = zeros(n);
b = 0;

@time for e=1:epochs
    
    idx = randperm(m)
    
    for k=1:m
        
        x = vec(X[idx[k],:])
        y = Y[idx[k]]
        
        α = 1/sqrt(m*e+k)
        
        err = 1 - y*(dot(w,x) + b)
        if(err > 0) 
            w = w - α*(- y*x + λ*w)
            b = b - α*(-y)
        else
            w = w - α*λ*w
        end
        
    end
    
end

  9.912297 seconds (8.34 M allocations: 7.148 GB, 10.19% gc time)

In [25]:

pred = [dot(vec(X[i,:]),w) + b for i=1:m]
correct = Y.*pred
@printf "Prediction accuracy on the training set = %.2f %%" sum(correct .> 0) / m

Xtt, Ytt = testdata()
Xtt = Xtt' ./255;
mt = size(Xtt,1)
oddind = map(isodd, round(Int64,Ytt))
Ytt[oddind] = -1
Ytt[!oddind] = +1
println("")

pred = [dot(vec(Xtt[i,:]),w) + b for i=1:mt]
correct = Ytt.*pred
@printf "Prediction accuracy on the test set (size=%d) = %.2f %%" m (sum(correct .> 0)/mt)

Prediction accuracy on the training set = 0.89 %
Prediction accuracy on the test set (size=60000) = 0.89 %