Some paper suggestions are added for the mini-project (lecture website)
Raoul Lefmann: Parallelization in Julia (Ubung date: May 18th)
$\newcommand{\R}{\mathbb{R}}$
$$ \min_{x \in R^n} \;\; f(x) $$$f$ is a convex function
Method Choices:
$f: \R^n \to \R$, a subgradient of $f$ at $x$ is a vector $g$ satisfying
$$ f(y) \ge f(x) + \langle g, y-x \rangle, \;\; \forall \R^n. $$Gradient: $s = -\nabla f(x)$ is a descent direction, i.e., $\langle \nabla f(x), s \rangle < 0$.
Subgradient: $s = -g$ is not necessarily descent.
Ex. $f(x,y) = 2|x| + |y|$
using PyPlot
n = 200
x = linspace(-2,2,n)
y = linspace(-2,2,n)
xgrid = repmat(x',n,1)
ygrid = repmat(y,1,n)
z = zeros(n,n)
for i in 1:n
for j in 1:n
z[i:i,j:j] = 2abs(x[i]) + abs(y[j])
end
end
cp = plt[:contour](xgrid, ygrid, z, colors="blue", linewidth=2.0)
plt[:clabel](cp, inline=1, fontsize=8)
# xlabel("x")
# ylabel("y")
#tight_layout()
ax = gca()
ax[:spines]["top"][:set_visible](false) # Hide the top edge of the axis
ax[:spines]["right"][:set_visible](false) # Hide the right edge of the axis
ax[:spines]["left"][:set_position]("center") # Move the right axis to the center
ax[:spines]["bottom"][:set_position]("center") # Most the bottom axis to the center
ax[:xaxis][:set_ticks_position]("bottom") # Set the x-ticks to only the bottom
ax[:yaxis][:set_ticks_position]("left") # Set the y-ticks to only the left
arrow(1,0,.5,.25,
head_width=0.1,
width=0.00015,
head_length=0.05,
overhang=0.5,
head_starts_at_zero="false",
facecolor="black")
annotate("(2,1)", xy=[1.6; 0.3])
# annotate("Look, data!",
# xy=[.5;.5],
# xytext=[2;1],
# xycoords="data",
# arrowprops=Dict("facecolor"=>"black"))
PyObject <matplotlib.text.Annotation object at 0x32f448d50>
Some Proofs
$$ h_y := \{g : \langle g, y-x \rangle \le f(y) - f(x)\} $$is a closed half-space. Therefore,
$$ \partial f(x) = \cap_{y \in \R^n} h_y $$is closed and convex.
If $f$ is Lipschitz continuous with a constant $L>0$, $f(y) - f(x) \le L\|y-x\|$.
$$ \partial f(x) \subseteq \{ g: \langle g, d \rangle \le L\|d\|, \forall d \in \R^n\} $$Therefore $\|g\|$ must be bounded (by $L$).
Ex. $f(x) = \|x\|_1 = \sum_{i=1}^n |x_i|$
using PyPlot
n = 200
x = linspace(-1,1,n)
y = linspace(-1,1,n)
xgrid = repmat(x',n,1)
ygrid = repmat(y,1,n)
z = zeros(n,n)
for i in 1:n
for j in 1:n
z[i:i,j:j] = abs(x[i]) + abs(y[j])
end
end
fig = figure("pyplot_surfaceplot",figsize=(10,10))
ax = fig[:add_subplot](1,1,1, projection = "3d")
ax[:plot_surface](xgrid, ygrid, z, rstride=2,edgecolors="k", cstride=2, cmap=ColorMap("gray"), alpha=0.8, linewidth=0.25)
xlabel("x")
ylabel("y")
PyObject <matplotlib.text.Text object at 0x32a10e9d0>
where
$$ g_i = \begin{cases} 1 & \text{if } x_i > 0\\ -1 & \text{if } x_i <0\\ [-1,1] & o.w. \end{cases} $$Changes to gradient descent:
Assumptions:
We can show that the $x_\text{best}$ until $k$th iteration satisfies: $$ f(x_{k, \text{best}}) - f(x^*) \le \frac{D^2 + M^2 \sum_{i=1}^k \alpha_i^2}{2\sum_{i=1}^k \alpha_i} $$
Constant stepsize $\alpha_i = c$
Recall: gradient descent had a faster $O(1/k)$ conv rate
Subgradient method + Randomness
Motivation
Consider a objective function with a very large summation:
$$ \min_{x\in X \subset \R^n} \;\; \sum_{i=1}^m \ell(x; D_i) $$where $\ell(\cdot; D_i): \R^n \to \R$ is a convex function.
The (full) gradient of $f$ will be (assuming $f$ diff'able), $$ \nabla f(x) = \sum_{i=1}^m \; \nabla \ell(x; D_i ) $$
Q. Can we avoid this?
Consider a stochastic subgradient $G$ of $f$,
$\newcommand{\E}{\mathbb{E}}$
$$ f(y) \ge f(x) + \langle \E[G], y-x \rangle . $$Properties:
$G$ is unbised $$ \E[G] = g \in \partial f(x) $$
$G$ can be noisy $$ \text{E.g.,}\;\; G = g + v, \;\; g \in \partial f(x), \E[v] = 0 $$
The usual formulation has
$$ \min_{x \in X}\;\; \tilde f(x) = \frac{1}{m} \sum_{i=1}^m F(x;\xi_i) $$This is called Sample Average Approximation (SAA), or Empirical Risk Minimization (ERM)
In data science, what we really want to do is
$$ \min_{x \in X} \;\; f(x) = \E[ F(x;\xi_i) ] = \int_\Xi F(x;\xi) dP(\xi) $$This is called Stochastic Approximation (SA) or Risk Minimization
Assumptions:
Two constants:
$f$ is $c$-strongly convex, and stepsize $\alpha_k = \frac{\theta}{k}$ with $\theta > \frac{1}{2c}$,
$f$ is diff'able and $\nabla f$ is $L$-Lipschitz continuous, and $x^* \in \text{int}\, X$,