We are now interested in the minimization of a function assuming a very specific shape, namely
where the are convex functions. This situation arises in many machine learning algorithms where we wish to minimize an empirical loss.
If is a uniformly chosen index in , then the random variable is an unbiased estimator of . This is the key to stochastic gradient descent algorithms: instead of going in the direction of the full gradient , which would require to evaluate the gradients to get , we go in the direction of the partial gradient . This kind of methods fall within the framework defined long ago by Robbins and Monro and there is a vast litterature on the topic. However, it turns out that convexity-type assumptions on the already yield reasonnable convergence results; the results are similar to the classical gradient descent result.
We recall that
-strongly convex and
In the sequel we will suppose that all the are -strongly convex and -smooth and it is easy to check that in this case, so is . We will denote by the unique minimizer of . At this point, .
For all , we set . The SGD is given by the updates
If all the are -strongly convex and -smooth, and if the step size is smaller than , then
where is the local variance of around the minimum:
What comes out of this theorem, compared with the non-stochastic gradient descent, is the term. Indeed, there can be no general convergence result for SGD since, even when , the dynamics does not stop because there is no reason for to be zero. However, the theorem says that eventually, stays within distance roughly of the minimizer .
The rest of the note is devoted to the proof of this result, which is marginally more technical than the classical exponential convergence of gradient descent. In the following note, I will prove yet another similar result, but where we considerably loosen the conditions on and in particular allow for non-convexity.
By developing the square norm , we get
Let us take expectations conditionnally on ; for simplicity we note for the conditional expectation given all the information before timestep . We have
Since is an unbiased estimator of , the term in the middle is equal to . By (2) applied to and (that's the same argument as for gradient descent).
and we now have
Bounding is not directly accessible due to the inherent randomness of the problem. However, we can bound . The key technical lemma will be the following:
Now we can finish the proof of the main theorem: indeed, directly plugging this estimate into the definition of yields
and since we took , we finally get and
This is a classical recursion of the form . An easy recursion yields , hence
Pick any and random index uniform over . Then . Using then averaging we get
Since , the second term is , giving the second term in (11). Now, we have to bound the first term. This will come as a consequence of strong convexity and smoothness.
Proof. Indeed, let be any points; introducing we have .
The first term is smaller than just by mere convexity.
By smoothness, .
We thus have
and this upper bound is quadratic in so it's readily minimized; the minimum is attained at the point
and the corresponding value of the upper bound is (16).
Apply the lemma to :
We now average over the random index ; the average of the upper bound is
since . Plugging this in (15) gives (11).
This proof was shamefully extracted from the magnificient survey on gradient descent algorithms by Gower and Garrigos.