Gradient descent I: strongly convex functions

April 2022

A function ff is convex when f(x)f(y)f(y),xyf(x) - f(y) \geqslant \langle \nabla f(y), x-y\rangle; with this pretty loose definition, it can very well happen that two different points x,yx,y have the same value f(x)=f(y)f(x) = f(y). Finding a minimum of ff with gradient descent can then lead to very different behaviours. For smooth functions, we have more quantitative notions of convexity and they lead to convergence results for the gradient descent algorithm.

Quantitative convexity

The largest eigenvalues of the Hessian of ff determine the Lipschitz smoothness of ff, while the smallest ones determine the strong convexity of ff. More precisely, if λmin(x),λmax(x)\lambda_{\min}(x), \lambda_{\max}(x) are the smallest and largest eigenvalues of 2f(x)\nabla^2 f(x) (the Hessian of ff at point xx, a square matrix), then by elementary arguments we have

λmin(x)v2v,2f(x)vλmax(x)v2. \lambda_{\min}(x)|v|^2 \leqslant \langle v, \nabla^2 f(x)v\rangle \leqslant \lambda_{\max}(x)|v|^2.

But then, if we fix two vectors x,yx,y, using Taylor's formulas yields the following result.

Let ff be a C2\mathscr{C}^2, convex function. If there are two numbers 0<ηM0< \eta \leqslant M such that for every xx, we have ηλmin(x)\eta \leqslant \lambda_{\min}(x) and λmax(x)M\lambda_{\max}(x)\leqslant M, then for every x,yx,y, f(x),yx+η2xy2f(y)f(x)f(x),yx+M2xy2. \langle \nabla f(x), y-x\rangle + \frac{\eta}{2}|x-y|^2 \leqslant f(y) - f(x) \leqslant \langle \nabla f(x), y-x\rangle + \frac{M}{2}|x-y|^2.

The lower bound is a stronger version of convexity; it is often called η\eta-strong convexity. The second one is indeed nothing more than the fact that if the Hessian norm is bounded by MM, then the gradient of ff is MM-Lipschitz. We say that ff is MM-smooth. In general, (2) is chosen as a definition of strong convexity and smoothness, without asking for any regularity assumption.

Convergence of the gradient descent dynamic

Strongly convex functions have a unique minimum, say xx. The gradient descent algorithm for finding this minimum consists in following the steepest descent direction, given by the gradient of ff, with a step size of ε\varepsilon: starting from x0x_0, this descent is written

xn+1=xnεf(xn). x_{n+1} = x_n - \varepsilon \nabla f(x_n).

Theorem.

If ff is η\eta-strongly convex and MM-smooth, and if the step size ε\varepsilon is smaller than 1/M1/M, then for every nn we have

xnx2(1εη)nx0x2. |x_n - x|^2 \leqslant (1 - \varepsilon \eta)^n |x_0 - x|^2.
Proof. Developing the euclidean norm and using the LHS of (2) with x=xnx = x_n and y=xy = x, we get xn+1x2=xnεf(xn)x2=xnx22εf(xn),xnx+ε2f(xn)2xnx2+2ε(f(x)f(xn))ηεxxn2+ε2f(xn)2xnx2(1ηε)+z\begin{aligned}|x_{n+1} - x|^2 &= |x_n - \varepsilon \nabla f(x_n) - x|^2 \\ &= |x_n - x|^2 - 2\varepsilon \langle \nabla f(x_n), x_n - x\rangle + \varepsilon^2 |\nabla f(x_n)|^2 \\ &\leqslant |x_n - x|^2 + 2\varepsilon(f(x) -f(x_n)) - \eta \varepsilon|x-x_n|^2 + \varepsilon^2 |\nabla f(x_n)|^2 \\ &\leqslant |x_n - x|^2 (1 - \eta \varepsilon) + z \end{aligned} where z=ε2f(xn)22ε(f(xn)f(x))z = \varepsilon^2|\nabla f(x_n)|^2 - 2\varepsilon(f(x_n) - f(x)). Now we only have to check that zz is nonpositive; if so, we will obtain xn+1x2xnx2(1ηε)|x_{n+1} - x|^2 \leqslant |x_n -x|^2 (1 - \eta \varepsilon) and a recursion will finish the proof.

Proof of z0z\leqslant 0. By the RHS of (2) and the definition of xn+1x_{n+1},

f(xn+1)f(xn)εf(xn)2+ε2M2f(xn)2=εf(xn)2(εM/21) f(x_{n+1}) - f(x_n) \leqslant -\varepsilon |\nabla f(x_n)|^2 + \frac{\varepsilon^2 M}{2}|\nabla f(x_n)|^2 = \varepsilon |\nabla f(x_n)|^2 ( \varepsilon M/2-1)

but then, if εM 1\varepsilon M \leqslant 1, the RHS is smaller than ε/2f(xn)2-\varepsilon/2 |\nabla f(x_n)|^2. Overall, we get that

εf(xn)22(f(xn)f(xn+1))2(f(xn)f(x)) \varepsilon |\nabla f(x_n)|^2 \leqslant 2(f(x_n) - f(x_{n+1})) \leqslant 2(f(x_n) - f(x))

or equivalently z0z\leqslant 0.


There are several takeaway messages in this elementary but powerful result.

Stochastic gradient descent

There are many situations where one would like to perform gradient descent, but access to the real gradient f(xn)\nabla f(x_n) at every step can be very resource-consuming. This is the case in many machine learning algorithms where ff is a sum of individual losses over a dataset and estimationg the whole gradient would need to compute as many complicated gradients as elements in the dataset. In this follow-up note I give the same kind of result, but for SGD when the loss still satisfies inequalities like (2).