A function is convex when ; with this pretty loose definition, it can very well happen that two different points have the same value . Finding a minimum of with gradient descent can then lead to very different behaviours. For smooth functions, we have more quantitative notions of convexity and they lead to convergence results for the gradient descent algorithm.
The largest eigenvalues of the Hessian of determine the Lipschitz smoothness of , while the smallest ones determine the strong convexity of . More precisely, if are the smallest and largest eigenvalues of (the Hessian of at point , a square matrix), then by elementary arguments we have
But then, if we fix two vectors , using Taylor's formulas yields the following result.
The lower bound is a stronger version of convexity; it is often called -strong convexity. The second one is indeed nothing more than the fact that if the Hessian norm is bounded by , then the gradient of is -Lipschitz. We say that is -smooth. In general, (2) is chosen as a definition of strong convexity and smoothness, without asking for any regularity assumption.
Strongly convex functions have a unique minimum, say . The gradient descent algorithm for finding this minimum consists in following the steepest descent direction, given by the gradient of , with a step size of : starting from , this descent is written
Theorem.
If is -strongly convex and -smooth, and if the step size is smaller than , then for every we have
Proof of . By the RHS of (2) and the definition of ,
but then, if , the RHS is smaller than . Overall, we get that
or equivalently .
There are several takeaway messages in this elementary but powerful result.
Having a step size smaller than the Lipschitz constant is mandatory to get something meaningful. Otherwise, if the step size is too large, we could take steps that are consistently too big and overshoot the minimum at every step.
We don't need to be smooth, only that is -strongly convex and -smooth, or equivalently that (2) holds.
Reaching a point at a distance smaller than from the minimum will need a number of steps of order with .
This quantity can be interpreted as a « global » conditionning number of the Hessian of : indeed, the conditionning of the Hessian at every point is smaller than . The smaller , the faster the convergence. Elementary preconditioning methods consist in replacing the objective function by for some matrix . The function stays strongly convex and smooth, but the gradient is and the Hessian is ; a careful choice of will really reduce the conditionning of the problem.
When , we are doing nothing more than an iterative method to find a solution for . The gradient of is , while the Hessian is : it no longer depends on . Its eigenvalues are the eigenvalues of , hence is really the conditioning number of .
There are many situations where one would like to perform gradient descent, but access to the real gradient at every step can be very resource-consuming. This is the case in many machine learning algorithms where is a sum of individual losses over a dataset and estimationg the whole gradient would need to compute as many complicated gradients as elements in the dataset. In this follow-up note I give the same kind of result, but for SGD when the loss still satisfies inequalities like (2).