A function f is convex when f(x)−f(y)⩾⟨∇f(y),x−y⟩; with this pretty loose definition, it can very well happen that two different points x,y have the same value f(x)=f(y). Finding a minimum of f with gradient descent can then lead to very different behaviours. For smooth functions, we have more quantitative notions of convexity and they lead to convergence results for the gradient descent algorithm.
The largest eigenvalues of the Hessian of f determine the Lipschitz smoothness of f, while the smallest ones determine the strong convexity of f. More precisely, if λmin(x),λmax(x) are the smallest and largest eigenvalues of ∇2f(x) (the Hessian of f at point x, a square matrix), then by elementary arguments we have
λmin(x)∣v∣2⩽⟨v,∇2f(x)v⟩⩽λmax(x)∣v∣2.
But then, if we fix two vectors x,y, using Taylor's formulas yields the following result.
Let
f be a
C2, convex function. If there are two numbers
0<η⩽M such that for every
x, we have
η⩽λmin(x) and
λmax(x)⩽M, then for every
x,y,
⟨∇f(x),y−x⟩+2η∣x−y∣2⩽f(y)−f(x)⩽⟨∇f(x),y−x⟩+2M∣x−y∣2.
The lower bound is a stronger version of convexity; it is often called η-strong convexity. The second one is indeed nothing more than the fact that if the Hessian norm is bounded by M, then the gradient of f is M-Lipschitz. We say that f is M-smooth.
Strongly convex functions have a unique minimum, say x. The gradient descent algorithm for finding this minimum consists in following the steepest descent direction, given by the gradient of f, with a step size of ε: starting from x0, this descent is written
xn+1=xn−ε∇f(xn).
Theorem.
If f is η-strongly convex and M-smooth, and if the step size ε is smaller than 1/M, then for every n we have
∣xn−x∣2⩽(1−εη)n∣x0−x∣2.
Proof. Developing the euclidean norm and using the LHS of (2) with y=xn, we get
∣xn+1−x∣2=∣xn−ε∇f(xn)−x∣2=∣xn−x∣2−2ε⟨∇f(xn),xn−x⟩+ε2∣∇f(xn)∣2⩽∣xn−x∣2+2ε(f(x)−f(xn))−ηε∣x−xn∣2+ε2∣∇f(xn)∣2⩽∣xn−x∣2(1−ηε)+z
where z=ε2∣∇f(xn)∣2−2ε(f(xn)−f(x)). Now we only have to check that z is nonpositive; if so, we will obtain ∣xn+1−x∣2⩽∣xn−x∣2(1−ηε) and a recursion will finish the proof.
Proof of z⩽0. By the RHS of (2) and the definition of xn+1,
f(xn+1)−f(xn)⩽−ε∣∇f(xn)∣2+2ε2M∣∇f(xn)∣2=ε∣∇f(xn)∣2(εM/2−1)
but then, if εM ⩽1, the RHS is smaller than −ε/2∣∇f(xn)∣2. Overall, we get that
ε∣∇f(xn)∣2⩽2(f(xn)−f(xn+1))⩽2(f(xn)−f(x))
or equivalently z⩽0.
There are several takeaway messages in this elementary but powerful result.
Having a step size ε smaller than the Lipschitz constant M is mandatory to get something meaningful. Otherwise, if the step size is too large, we could take steps that are consistently too big and overshoot the minimum at every step.
We don't need f to be smooth, only that f is η-strongly convex and M-smooth, or equivalently that (2) holds.
Reaching a point at a distance smaller than δ from the minimum x will need a number of steps of order ≈κln(1/δ) with κ=M/η.
This quantity κ can be interpreted as a « global » conditionning number of the Hessian of f: indeed, the conditionning of the Hessian at every point is smaller than κ. The smaller κ, the faster the convergence. Elementary preconditioning methods consist in replacing the objective function f by x↦f(Px) for some matrix P. The function stays strongly convex and smooth, but the gradient is P∇f(Px) and the Hessian is P∇2f(Px)P∗; a careful choice of P will really reduce the conditionning of the problem.
When f(x)=∣Ax−b∣2/2, we are doing nothing more than an iterative method to find a solution for Ax=b. The gradient of f is Ax, while the Hessian is A: it no longer depends on x. Its eigenvalues are the eigenvalues of A, hence κ is really the conditioning number of A.