Subgradient method + Randomness
Motivation
Consider an objective with a very large sum over $m$ data points:
$$ \min_{x\in X \subset \R^n} \;\; \frac{1}{m}\sum_{i=1}^m \ell(x; D_i) \;\; \approx \;\; \E [ \ell(x; D) ] $$where $\ell(\cdot; D_i): \R^n \to \R$ is a convex function.
Assumptions:
Two constants:
Make the following change to GD: $k=1,2,\dots,$:
$$ x_{k+1} = \prod_{X} (x_k - \alpha_k G(x_k, \xi_k)) $$Choose stepsize $\alpha_k = \frac{\theta}{k}$ with $\theta > \frac{1}{2c}$.
Given a starting point $x_1 \in \R^n$.
For $k=1,2,\dots$: $$ x_{k+1} = \prod_{X} (x_k - \alpha_k G(x_k, \xi_k)) $$
For all $x,y \in X$,
$$ f(y) \ge f(x) + g(x)^T (y-x) + \frac{c}{2} \|y-x\|^2, \;\; \forall g(x) \in \partial f(x) $$Adding them together, we get
$$ (g(y) - g(x))^T (y-x) \ge c \|y-x\|^2 $$In particular,
$$ (g(x_k) - g(x^*))^T (x_k - x^*) \ge c \|x_k -x^*\|^2 $$$x^*$ is a minimizer of the convex optimization $\min_{x\in X} f(x)$ iff
$$ (x_k - x^*)^T g(x^*) \ge 0, \;\; \forall x_k \in X, \forall g(x^*) \in \partial f(x^*) $$From the two results, $\forall g(x_k) \in \partial f(x_k)$ and $\forall g(x^*) \in \partial f(x^*)$
$$ \begin{cases} (g(x_k) - g(x^*))^T (x_k - x^*) \ge c \|x_k - x^*\|^2 & \\ (x_k - x^*)^T g(x^*) \ge 0 \end{cases} $$$$ \Rightarrow g(x_k)^T (x_k - x^*) \ge c \|x_k - x^*\|^2 \qquad (*) $$Quantity of interest: $D_k := \frac12 \|x_k - x^*\|^2$ and $d_k := \E[D_k]$
$$ \begin{aligned} D_{k+1} &= \frac12 \big\| \prod_X(x_k - \alpha_k G(x_k,\xi_k)) - x^* \big\|^2 \\ &= \frac12 \big\| \prod_X(x_k - \alpha_k G(x_k,\xi_k)) - \prod_X(x^*) \big\|^2 \\ &\le \frac12 \big\| x_k - \alpha_k G(x_k,\xi_k) - x^* \|^2 \\ &= D_k + \frac12 \alpha_k^2 \|G(x_k,\xi_k)\|^2 - \alpha_k (x_k-x^*)^T G(x_k,\xi_k) \end{aligned} $$Recall that $x_k = x_k(\xi_{[k-1]})$ is independent of $\xi_k$,
$$ \begin{aligned} \E[ (x_k-x^*)^T G(x_k,\xi_k) ] &= \E\big\{ \E[ (x_k-x^*)^T G(x_k,\xi_k) \,\big|\, \xi_{[k-1]} ] \big\} \\ &= \E\big\{ (x_k-x^*)^T \E [G(x_k,\xi_k) \,\big|\, \xi_{[k-1]} ] \big\} \\ \end{aligned} $$Recall: $$ g(x_k)^T (x_k - x^*) \ge c \|x_k - x^*\|^2 \qquad (*) $$
Finally,
$$ d_{k+1} \le d_k + \frac12 \alpha_k^2 M^2 - \alpha_k \E[ (x_k-x^*)^T G(x_k,\xi_k) ] $$$$ \Rightarrow d_{k+1} \le d_k + \frac12 \alpha_k^2 M^2 - 2c \alpha_k d_k = (1-2c\alpha_k) d_k + \frac12 \alpha_k^2 M^2 $$Setting $\alpha_k = \frac{\theta}{k}$ for $\theta > \frac{1}{2c}$, we can show using induction that
$$ \E[ \|x_k - x^*\|^2 ] = 2 d_k \le \frac{1}{k} \max\big\{ \frac{\theta^2 M^2}{2c\theta - 1}, \|x_1-x^*\|^2 \big\} . $$Choose stepsize $\alpha_k = \frac{\theta}{k}$ with $\theta > \frac{1}{2c}$.
Convergence:
Iterate convergence: $$\mathbb E [ \|x_k - x^*\|^2 ] \le \frac{1}{k} \max\left\{\frac{\theta^2M^2}{2c\theta-1}, \|x_1-x^*\|^2\right\}$$
If $f$ is diff'able and $\nabla f$ is $L$-Lipschitz continuous, and $x^* \in \text{int}\, X$: $$ \mathbb E [ f(x_k) - f(x^*) ] \le \frac{L}{2k} \max\left\{\frac{\theta^2M^2}{2c\theta-1}, \|x_1-x^*\|^2\right\} $$
The convergence is sensitive to
Ex. $f(x) = x^2/10$, $G(x,\xi) = \nabla f(x)$ (no noise), $X = [-1,1]$
Ex. $f(x) = x^2/10$, $G(x,\xi) = \nabla f(x)$ (no noise), $X = [-1,1]$
With $k=10^9$, we still have $x_k > 0.015$
Ex. $f(x) = x^2/10$, $G(x,\xi) = \nabla f(x)$ (no noise), $X = [-1,1]$
With the correct value $c=0.2$:
Choosing $\theta = \frac{1}{c} = 5$, $\alpha_k = \frac{\theta}{k} = \frac{5}{k}$:
$$ x_2 = x_1 - \alpha_k \nabla f(x_1) = 0 $$That is, it'll converge to the minimizer $x^*=0$ in one step!
using PyPlot
x = -1:.01:5
plot(x, log10(1+x), x, x)
ax = gca()
ax[:spines]["top"][:set_visible](false) # Hide the top edge of the axis
ax[:spines]["right"][:set_visible](false) # Hide the right edge of the axis
ax[:spines]["left"][:set_position]("zero") # Move the right axis to the center
ax[:spines]["bottom"][:set_position]("zero") # Most the bottom axis to the center
ax[:xaxis][:set_ticks_position]("bottom") # Set the x-ticks to only the bottom
ax[:yaxis][:set_ticks_position]("left") # Set the y-ticks to only the left
annotate("y=x", xy=[4.9; 4.5])
annotate("y=ln(1+x)", xy=[4.9; log10(1+4.9)+.2]);
using PyPlot
x=0:.01:5
plot(x, 1./(5x-1))
ylim([0,.5]);
Differences to the classical SGD:
Constant stepsize $\alpha_k = \frac{D}{M\sqrt{K}}$, $$ \E[ f(\tilde x_s^K) - f(x^*) ] \le \frac{DM}{\sqrt{K}} \left[ \frac{2K}{K-s+1} + \frac{1}{2} \right] $$
Diminishing stepsize $\alpha_k = \frac{\theta D}{M\sqrt{k}}$ for any $\theta>0$, $$ \E[ f(\tilde x_s^k) - f(x^*) ] \le \frac{DM}{\sqrt{k}} \left[ \frac{2}{\theta} \left( \frac{k}{k-s+1} \right) + \frac{\theta}{2} \sqrt{\frac{k}{s}} \right] $$
$s = \lceil rk \rceil$, $r \in (0,1)$: $$ \E[ f(\tilde x_s^k) - f(x^*) ] \le \frac{DM}{\sqrt{k}} \left[ \frac{2}{\theta} \left( \frac{1}{1-r} \right) + \frac{\theta}{2} \sqrt{r} \right] $$
using MNIST
Xtr,Ytr = traindata()
Xtr = Xtr' ./255;
# Reduce the size of X
using StatsBase
m, n = size(Xtr)
println("m=$m, n=$n")
s = 60000
idx = sample(1:m, s)
X = Xtr[idx,:]
Y = Ytr[idx];
m, n = size(X)
println("Reduced: m=$m, n=$n")
# Make it binary classfication problem
# even vs odd
oddind = map(isodd, round(Int64,Y))
Y[oddind] = -1
Y[!oddind] = +1
np = length(find(Y.==1))
nn = s - np;
println("No. classes: +1=$np, -1=$nn")
λ=0.01;
m=60000, n=784 Reduced: m=60000, n=784 No. classes: +1=29610, -1=30390
epochs = 5;
#
# Linear SVM min_{w,b} 1/m \sum_i^m \max(0, 1 - y(w^Tx +b)) + λ/2 \|w\|^2
#
# G: if 1 - y(w^Tx+b) > 0, [-y x; -y] + [λ w; 0]
# o.w. [λ w; 0]
w = zeros(n);
b = 0;
@time for e=1:epochs
idx = randperm(m)
for k=1:m
x = vec(X[idx[k],:])
y = Y[idx[k]]
α = 1/sqrt(m*e+k)
err = 1 - y*(dot(w,x) + b)
if(err > 0)
w = w - α*(- y*x + λ*w)
b = b - α*(-y)
else
w = w - α*λ*w
end
end
end
9.912297 seconds (8.34 M allocations: 7.148 GB, 10.19% gc time)
pred = [dot(vec(X[i,:]),w) + b for i=1:m]
correct = Y.*pred
@printf "Prediction accuracy on the training set = %.2f %%" sum(correct .> 0) / m
Xtt, Ytt = testdata()
Xtt = Xtt' ./255;
mt = size(Xtt,1)
oddind = map(isodd, round(Int64,Ytt))
Ytt[oddind] = -1
Ytt[!oddind] = +1
println("")
pred = [dot(vec(Xtt[i,:]),w) + b for i=1:mt]
correct = Ytt.*pred
@printf "Prediction accuracy on the test set (size=%d) = %.2f %%" m (sum(correct .> 0)/mt)
Prediction accuracy on the training set = 0.89 % Prediction accuracy on the test set (size=60000) = 0.89 %