We are now interested in the minimization of a function f assuming a very specific shape, namely
f(x)=n1i=1∑mfi(x)
where the fi:Rd→R are convex functions. This situation arises in many machine learning algorithms where we wish to minimize an empirical loss.
If i is a uniformly chosen index in {1,…,d}, then the random variable fi(x) is an unbiased estimator of f(x). This is the key to stochastic gradient descent algorithms: instead of going in the direction of the full gradient ∇f(x), which would require to evaluate the d gradients ∇fi(x) to get ∇f(x), we go in the direction of the partial gradient ∇fi(x). This kind of methods fall within the framework defined long ago by Robbins and Monro and there is a vast litterature on the topic. However, it turns out that convexity-type assumptions on the fi already yield reasonnable convergence results; the results are similar to the classical gradient descent result.
We recall that
f is
η-strongly convex and
M-smooth if
⟨∇f(x),y−x⟩+2η∣x−y∣2⩽f(y)−f(x)⩽⟨∇f(x),y−x⟩+2M∣x−y∣2.
In the sequel we will suppose that all the fi are η-strongly convex and M-smooth and it is easy to check that in this case, so is f. We will denote by x the unique minimizer of f. At this point, ∇f(x)=0.
For all n, we set in∼Unif{1,…,d}. The SGD is given by the updates
xn+1=xn−εn∇fin(xn).
If all the fi are η-strongly convex and M-smooth, and if the step size ε is smaller than 1/2M, then
E[∣xn−x∣2]⩽(1−εη)n∣x0−x∣2+η2εσ
where σ is the local variance of f around the minimum:
σ=Ej∼Unif{1,…,d}[∣∇fj(x)∣2].
What comes out of this theorem, compared with the non-stochastic gradient descent, is the 2εσ/η term. Indeed, there can be no general convergence result for SGD since, even when xn=x, the dynamics does not stop because there is no reason for ∇fi(x) to be zero. However, the theorem says that eventually, xn stays within distance roughly 2ε/η of the minimizer x.
The rest of the note is devoted to the proof of this result, which is marginally more technical than the classical exponential convergence of gradient descent. In the following note, I will prove yet another similar result, but where we considerably loosen the conditions on f and in particular allow for non-convexity.
By developing the square norm ∣xn+1−x∣2=∣xn+1−xn+xn−x∣2, we get
∣xn+1−x∣2=∣xn−x∣2−2ε⟨xn−x,∇fin(xn)⟩+ε2∣∇fin(xn)∣2.
Let us take expectations conditionnally on xn; for simplicity we note En for the conditional expectation given all the information before timestep n. We have
En[∣xn+1−x∣2]=∣xn−x∣2−2ε⟨xn−x,En[∇fin(xn)]⟩+ε2En[∣∇fin(xn)∣2].
Since ∇fin(x) is an unbiased estimator of ∇fin(x), the term in the middle is equal to −2ε⟨xn−x,∇f(xn)⟩. By (2) applied to x=xn and y=x (that's the same argument as for gradient descent).
−2ε⟨∇f(xn),xn−x⟩⩽2ε(f(x)−f(xn))−εη∣xn−x∣2
and we now have
En[∣xn+1−x∣2]⩽(1−εη)∣xn−x∣2+zn
where
zn=−2ε(f(xn)−f(x))+ε2En[∣∇fin(xn)∣2].
Bounding zn is not directly accessible due to the inherent randomness of the problem. However, we can bound E[zn]. The key technical lemma will be the following:
For any
j uniform over
[d] and any
y,
E[∣∇fj(y)∣2]⩽4M(f(y)−f(x))+2σ.
Now we can finish the proof of the main theorem: indeed, directly plugging this estimate into the definition of E[zn] yields
E[zn]⩽2ε2σ+E[f(xn)−f(x)]×(4ε2M−2ε)
and since we took ε<1/2M, we finally get E[zn]⩽2ε2σ and
E[∣xn+1−x∣2]⩽(1−εη)E[∣xn−x∣2]+2ε2σ.
This is a classical recursion of the form un+1⩽aun+b. An easy recursion yields un⩽b(1−an)/(1−a)+anu0, hence
E[∣xn−x∣2]⩽η2εσ+(1−ηε)nE[∣x0−x∣2].
Pick any y and random index j uniform over [d]. Then ∣∇fj(y)∣⩽∣∇fj(y)−∇fj(x)∣+∣∇fj(x)∣. Using (a+b)2⩽2a2+2b2 then averaging we get
E∣∇fj(y)∣2⩽2E∣∇fj(y)−∇fj(x)∣2+2E∣∇fj(x)∣2.
Since E∇fi(x)=∇f(x)=0, the second term is 2σ⋆(f), giving the second term in (11). Now, we have to bound the first term. This will come as a consequence of strong convexity and smoothness.
Lemma. Any convex
M-smooth function
g satisfies
g(x)−g(y)⩽⟨∇g(x),x−y⟩−2M1∣∇g(y)−∇g(x)∣2.
Proof. Indeed, let x,y,z be any points; introducing g(z) we have g(x)−g(y)=g(x)−g(z)+g(z)−g(y).
The first term is smaller than ⟨∇g(x),x−z⟩ just by mere convexity.
By smoothness, g(z)−g(y)⩽⟨∇g(y),z−y⟩+M∣z−y∣2/2.
We thus have
g(x)−g(y)⩽⟨∇g(x),x−z⟩+⟨∇g(y),z−y⟩+M∣z−y∣2/2
and this upper bound is quadratic in z so it's readily minimized; the minimum is attained at the point
z=y−M∇g(y)−∇g(x)
and the corresponding value of the upper bound is (16).
Apply the lemma to g=fj:
∣∇fj(y)−∇fj(x)∣2⩽2M(fj(y)−fj(x))+2M⟨∇fj(x),x−y⟩.
We now average over the random index j; the average of the upper bound is
2M(f(y)−f(x))+2M⟨∇f(x),x−y⟩=2M(f(y)−f(x))
since ∇f(x)=0. Plugging this in (15) gives (11).
This proof was shamefully extracted from the magnificient survey on gradient descent algorithms by Gower and Garrigos.