$f$ is a convex function
Method Choices:
Many optimization algorithms consist of:
For $k=1,2,\dots$
In gradient descent, we use $s_k = -\nabla f(x_k)$ (steepest descent dir)
Find a linear function $f_w(x) = x^T w$ parametrized by $w$ that makes smallest square prediction error.
Dataset: $\{ (x_i, y_i) \}_{i=1}^m$, $x_i \in R^n$, $y_i \in R$
Data (design) matrix $X\in R^{m\times n}$: $$ y = \begin{bmatrix} y_1 \\ y_2 \\ \vdots \\ y_m \end{bmatrix}, \quad X = \begin{bmatrix} \longleftarrow & x_1^T & \longrightarrow \\ \longleftarrow & x_2^T & \longrightarrow \\ \vdots & \vdots & \vdots \\ \longleftarrow & x_m^T & \longrightarrow \end{bmatrix}, \quad w = \begin{bmatrix} w_1 \\ w_2 \\ \vdots \\ w_n \end{bmatrix} $$
Objective: $$ L(w) = \frac12 \sum_{i=1}^m (y_i - x_i^Tw)^2 = \frac12 \|y - Xw\|^2 $$
Objective: $$ L(w) = \frac12 \|y - Xw\|^2 $$
Gradient : $$ \nabla L(w) = -X^T (y-Xw) \in R^n $$
When $f$ is convex, then $x^*$ is a minimizer of $$ \min_{x\in R^n} \;\; f(x) $$ if and only if (iff) $$ \nabla f(x^*) = 0. $$ A natural stopping criterion is therefore $$ \| \nabla f(x_k) \| < \epsilon $$ for some small $\epsilon >0$.
Scaling: how does $L(w) = \frac12 \|y-Xw\|^2$ scale with large $m$?
Stopping criterion: $\|\nabla f(x_k)\| < \epsilon$. Which norm will be better?
%%tikz --scale 1 --size 1600,1600 -f png
\draw[->] (-1,0) -- (2.5,0) node[right] {\scriptsize $R^n$};
\draw[domain=-1:2.2, smooth, variable=\x, blue] plot({\x}, {.5+.5*\x*\x});
\draw[domain=0:2.2, thick, variable=\y, red] plot({\y}, {1 + (\y-1)});
\draw[dotted] (1,1) -- (1,0) node[below] {\tiny $x_k$};
\draw[dotted] (1.8,.5+.5*1.8^2) -- (1.8,0) node[below] {\tiny $x_k + s_k$};
\fill (1,1) circle [radius=1pt];
\fill (1.8,.5+.5*1.8^2) circle [radius=1pt];
\fill (1.8,1.8) circle [radius=1pt];
\draw[dashed,->] (1.8,.5+.5*1.8^2) -- (3.5, 1+.5*1.8^2) node[right] {$f(x_k + s_k)$};
\draw[dashed,->] (1.8,1.8) -- (7, .7+.5*1.8^2) node[above] {$\approx f(x_k) + \nabla f(x_k)^T s_k$};
We want: $~~f(x_k+s_k) \le f(x_k)$
%%tikz --scale 1 --size 1600,1600 -f png
\draw[->] (-1,0) -- (2.5,0) node[right] {\scriptsize $R^n$};
\draw[domain=-1.5:2.2, smooth, variable=\x, blue] plot({\x}, {.5+.5*\x*\x});
\draw[domain=-.9:2, thick, variable=\y, red] plot({\y}, {1 + (\y-1) + (\y-1)*(\y-1)});
\draw[dotted] (1,1) -- (1,0) node[below] {\tiny $x_k$};
\draw[dotted] (.5,.75) -- (.5,0) node[below] {\tiny $x_{k+1}$};
\fill (1,1) circle [radius=1pt];
\fill (.5,.75) circle [radius=1pt];
\draw[dashed,->] (1.8,1+.8+.8^2) -- (3, 1.5+.5*1.8^2) node[right] {$f(x) \approx f(x_k) + \nabla f(x_k)^T(x-x_k) + \frac{1}{2\alpha_k}\|x-x_k\|^2$};
Take the minimizer of RHS as the next iterate $x_{k+1}$:
$$ \nabla f(x_k) + (x_{k+1}-x_k)/\alpha_k = 0 \;\; \Rightarrow \;\; x_{k+1} = x_k - \alpha_k \nabla f(x_k) $$Convergence of GD depends on how we choose stepsizes:
The best choice is dependent on the property of the objective function:
Take $\alpha_k =c$ for a constant $c>0$: $x_{k+1} = x_k - c s_k$.
Surprisingly many ML papers do something like:
We designed a gradient descent algorithm with a fixed stepsize 0.1, which worked fine for the five datasets we tried.
What can be problematic?
Given $x_k$ and $s_k$, find $\alpha_k$ such that
$$ \alpha_k = \arg\min_{\alpha>0} \; f(x_k + \alpha s_k) $$Solving this is not easy in general.
Exception: when $f$ is quadratic: $\;\;f = \frac12 x^T H x + c^Tx$
Wolfe Linesearch
Backtracking (Armijo) Linesearch
Find $\alpha$ satisfying Armijo condition for some $c_1 \in (0,1)$:
$$ f(x_k + \alpha s_k) \le f(x_k) + c_1 \alpha \nabla f(x_k)^T s_k $$Given $\alpha_0$, $\eta \in (0,1)$, and $c_1 \in (0,1)$:
Computation Cost?
LS will be practical when the function evaluation is not too expensive.
Let $f: R^n \to R$ differentiable. $\nabla f$ is Lipschitz continuous if
$$ \| \nabla f(y) - \nabla f(x) \| \le L \|y-x\|, \;\; \forall x, y \in R^n $$for some constant $L>0$ (Lipschitz constant)
Let $f: R^n \to R$ differentiable. $f$ is strongly convex if
$$ f((1-\lambda)y + \lambda x) \le (1-\lambda) f(y) + \lambda f(x) - \frac{\alpha}{2}\lambda (1-\lambda) \|y-x\|^2 $$$\forall x, y \in R^n, \forall \lambda \in [0,1], \; \text{for some $\alpha>0$} $
$\nabla f$ is Lipschitz continuous with $L>0$ if and only if $$ \begin{aligned} & f(y) \le f(x) + \langle \nabla f(x), y-x \rangle + \frac{L}{2} \|y-x\|^2, \;\; \forall x,y \in R^n \\ & \langle \nabla f(y) - \nabla f(x), y-x \rangle \le L \|y-x\|^2 \;\; \forall x,y \in R^n \end{aligned} $$
$f$ is $\alpha$-strongly convex iff $$ \begin{aligned} & f(y) \ge f(x) + \langle \nabla f(x), y-x \rangle + \frac{\alpha}{2} \|y-x\|^2, \;\; \forall x,y \in R^n \\ & \langle \nabla f(y) - \nabla f(x), y-x \rangle \ge \alpha \|y-x\|^2, \;\; \forall x,y \in R^n \end{aligned} $$
Lipschitz continuity of $\nabla f$:
Growth of $f$ is bounded above by a quadratic function everywhere
Strong convexity of $f$:
Growth of $f$ is bounded below by a quadratic function everywhere
If $\nabla^2 f$ exists, then
Ex. Linear regression $$ f(w) = \frac12 \|y - Xw\|^2 $$
function norm_est(X)
v = rand(n);
v = v / vecnorm(v);
return(vecnorm(X*v));
end
m=1000;
n=1000;
X = rand(m,n);
@time L = norm(X,2);
@time Lest = norm_est(X)
@printf "L=%f, Lest=%f\n" L Lest
15.532064 seconds (32 allocations: 763.524 MB, 0.01% gc time) 0.065352 seconds (679 allocations: 832.342 KB) L=5000.491537, Lest=4327.923540
Fixed stepsize $c$ | Backtracking LS | |
---|---|---|
$f$ Convex | $c \le \frac{1}{L}$ | |
$O(1/k)$ | $O(1/k) $ | |
$f$ Strongly Convex | $c \le \frac{2}{\alpha + L}$ | |
$O(\gamma^k)$ | $O(\gamma^k)$ |
$\gamma\in(0,1)$ depends on the condition number of $\nabla^2 f(x)$, i.e. $$ \frac{L}{\alpha} = \frac{\lambda_\max(\nabla^2 f(x))}{\lambda_\min(\nabla^2 f(x))} $$