Gradient descent II: stochastic gradient descent for convex functions

October 2023

We are now interested in the minimization of a function ff assuming a very specific shape, namely

f(x)=1ni=1mfi(x) f(x) = \frac{1}{n}\sum_{i=1}^m f_i(x)

where the fi:RdRf_i : \mathbb{R}^d \to \mathbb{R} are convex functions. This situation arises in many machine learning algorithms where we wish to minimize an empirical loss.

If ii is a uniformly chosen index in {1,,d}\{1, \dotsc, d\}, then the random variable fi(x)f_i(x) is an unbiased estimator of f(x)f(x). This is the key to stochastic gradient descent algorithms: instead of going in the direction of the full gradient f(x)\nabla f(x), which would require to evaluate the dd gradients fi(x)\nabla f_i(x) to get f(x)\nabla f(x), we go in the direction of the partial gradient fi(x)\nabla f_i(x). This kind of methods fall within the framework defined long ago by Robbins and Monro and there is a vast litterature on the topic. However, it turns out that convexity-type assumptions on the fif_i already yield reasonnable convergence results; the results are similar to the classical gradient descent result.

We recall that ff is η\eta-strongly convex and MM-smooth if f(x),yx+η2xy2f(y)f(x)f(x),yx+M2xy2. \langle \nabla f(x), y-x \rangle + \frac{\eta}{2}|x-y|^2 \leqslant f(y) - f(x) \leqslant \langle \nabla f(x), y-x \rangle + \frac{M}{2}|x-y|^2.

In the sequel we will suppose that all the fif_i are η\eta-strongly convex and MM-smooth and it is easy to check that in this case, so is ff. We will denote by xx the unique minimizer of ff. At this point, f(x)=0\nabla f(x)=0.

For all nn, we set inUnif{1,,d}i_n \sim \mathrm{Unif}\{1,\dotsc, d\}. The SGD is given by the updates

xn+1=xnεnfin(xn). x_{n+1} = x_n - \varepsilon_n \nabla f_{i_n}(x_n).

If all the fif_i are η\eta-strongly convex and MM-smooth, and if the step size ε\varepsilon is smaller than 1/2M1/2M, then

E[xnx2](1εη)nx0x2+2εησ \mathbb{E}[|x_n - x|^2] \leqslant (1 - \varepsilon \eta)^n |x_0 - x|^2 + \frac{2\varepsilon}{\eta}\sigma

where σ\sigma is the local variance of ff around the minimum:

σ=EjUnif{1,,d}[fj(x)2].\sigma = \mathbb{E}_{j \sim \mathrm{Unif}\{1, \dotsc, d\}}[|\nabla f_j(x)|^2].

What comes out of this theorem, compared with the non-stochastic gradient descent, is the 2εσ/η2\varepsilon\sigma / \eta term. Indeed, there can be no general convergence result for SGD since, even when xn=xx_n = x, the dynamics does not stop because there is no reason for fi(x)\nabla f_i(x) to be zero. However, the theorem says that eventually, xnx_n stays within distance roughly 2ε/η2\varepsilon / \eta of the minimizer xx.

The rest of the note is devoted to the proof of this result, which is marginally more technical than the classical exponential convergence of gradient descent. In the following note, I will prove yet another similar result, but where we considerably loosen the conditions on ff and in particular allow for non-convexity.

Proof

By developing the square norm xn+1x2=xn+1xn+xnx2|x_{n+1} - x|^2 = |x_{n+1} - x_n + x_n - x|^2, we get

xn+1x2=xnx22εxnx,fin(xn)+ε2fin(xn)2.\begin{aligned} |x_{n+1} - x|^2 = |x_n - x|^2 - 2\varepsilon\langle x_n - x, \nabla f_{i_n}(x_n)\rangle + \varepsilon^2 | \nabla f_{i_n}(x_n)|^2. \end{aligned}

Let us take expectations conditionnally on xnx_n; for simplicity we note En\mathbb{E}_n for the conditional expectation given all the information before timestep nn. We have

En[xn+1x2]=xnx22εxnx,En[fin(xn)]+ε2En[fin(xn)2].\mathbb{E}_n[|x_{n+1} - x|^2 ]= |x_n - x|^2 - 2\varepsilon\langle x_n - x, \mathbb{E}_n[\nabla f_{i_n}(x_n)]\rangle + \varepsilon^2 \mathbb{E}_n[| \nabla f_{i_n}(x_n)|^2].

Since fin(x)\nabla f_{i_n}(x) is an unbiased estimator of fin(x)\nabla f_{i_n}(x), the term in the middle is equal to 2εxnx,f(xn)-2\varepsilon\langle x_n - x, \nabla f(x_n)\rangle. By (2) applied to x=xnx=x_n and y=xy = x (that's the same argument as for gradient descent).

2εf(xn),xnx2ε(f(x)f(xn))εηxnx2 -2\varepsilon \langle \nabla f(x_n), x_n - x \rangle \leqslant 2\varepsilon(f(x) - f(x_n)) - \varepsilon\eta|x_n - x|^2

and we now have

En[xn+1x2](1εη)xnx2+zn\mathbb{E}_n[|x_{n+1} - x|^2 ]\leqslant (1 - \varepsilon \eta) |x_n - x|^2 + z_n

where

zn=2ε(f(xn)f(x))+ε2En[fin(xn)2]. z_n = - 2\varepsilon (f(x_n) - f(x)) + \varepsilon^2 \mathbb{E}_n[|\nabla f_{i_n}(x_n)|^2].

Bounding znz_n is not directly accessible due to the inherent randomness of the problem. However, we can bound E[zn]\mathbb{E}[z_n]. The key technical lemma will be the following:

For any jj uniform over [d][d] and any yy, E[fj(y)2]4M(f(y)f(x))+2σ. \mathbb{E}[|\nabla f_j(y)|^2 ] \leqslant 4 M (f(y) - f(x)) + 2\sigma .

Now we can finish the proof of the main theorem: indeed, directly plugging this estimate into the definition of E[zn]\mathbb{E}[z_n] yields

E[zn]2ε2σ+E[f(xn)f(x)]×(4ε2M2ε)\mathbb{E}[z_n] \leqslant 2 \varepsilon^2 \sigma + \mathbb{E}[f(x_n) - f(x)]\times \left(4\varepsilon^2 M - 2\varepsilon \right)

and since we took ε<1/2M\varepsilon < 1/2M, we finally get E[zn]2ε2σ\mathbb{E}[z_n] \leqslant 2\varepsilon^2 \sigma and

E[xn+1x2](1εη)E[xnx2]+2ε2σ. \mathbb{E}[|x_{n+1} - x|^2] \leqslant (1 - \varepsilon \eta)\mathbb{E}[|x_n - x|^2] + 2\varepsilon^2 \sigma.

This is a classical recursion of the form un+1aun+bu_{n+1} \leqslant a u_n + b. An easy recursion yields unb(1an)/(1a)+anu0 u_n \leqslant b(1-a^n)/(1-a) + a^n u_0 , hence

E[xnx2]2εησ+(1ηε)nE[x0x2]. \mathbb{E}[|x_n - x|^2] \leqslant\frac{2\varepsilon}{\eta}\sigma + (1 - \eta \varepsilon)^n \mathbb{E}[|x_0 - x|^2] .

Proof of the Lemma

Pick any yy and random index jj uniform over [d][d]. Then fj(y)fj(y)fj(x)+fj(x)|\nabla f_j(y)| \leqslant |\nabla f_j (y) - \nabla f_j(x)| + |\nabla f_j(x)|. Using (a+b)22a2+2b2(a+b)^2 \leqslant 2a^2 + 2b^2 then averaging we get

Efj(y)22Efj(y)fj(x)2+2Efj(x)2.\mathbb{E}|\nabla f_j(y)|^2 \leqslant 2\mathbb{E}|\nabla f_j (y) - \nabla f_j(x)|^2 + 2\mathbb{E}|\nabla f_j(x)|^2.

Since Efi(x)=f(x)=0\mathbb{E}\nabla f_i(x) = \nabla f(x) = 0, the second term is 2σ(f)2\sigma_\star(f), giving the second term in (11). Now, we have to bound the first term. This will come as a consequence of strong convexity and smoothness.

Lemma. Any convex MM-smooth function gg satisfies g(x)g(y)g(x),xy12Mg(y)g(x)2. g(x) - g(y) \leqslant \langle \nabla g(x), x-y\rangle - \frac{1}{2M}|\nabla g(y) - \nabla g(x)|^2.

Proof. Indeed, let x,y,zx,y,z be any points; introducing g(z)g(z) we have g(x)g(y)=g(x)g(z)+g(z)g(y)g(x)-g(y) = g(x) - g(z) + g(z) - g(y).

  • The first term is smaller than g(x),xz\langle \nabla g(x), x-z\rangle just by mere convexity.

  • By smoothness, g(z)g(y)g(y),zy+Mzy2/2g(z) - g(y) \leqslant \langle \nabla g(y), z-y\rangle + M|z-y|^2/2.

We thus have

g(x)g(y)g(x),xz+g(y),zy+Mzy2/2g(x) - g(y) \leqslant \langle \nabla g(x), x-z\rangle + \langle \nabla g(y), z-y\rangle + M|z-y|^2/2

and this upper bound is quadratic in zz so it's readily minimized; the minimum is attained at the point

z=yg(y)g(x)Mz = y - \frac{\nabla g(y) - \nabla g(x)}{M}

and the corresponding value of the upper bound is (16).

Apply the lemma to g=fjg = f_j

fj(y)fj(x)22M(fj(y)fj(x))+2Mfj(x),xy. |\nabla f_j(y) - \nabla f_j(x)|^2 \leqslant 2M(f_j(y) - f_j(x)) + 2M \langle \nabla f_j(x), x-y\rangle.

We now average over the random index jj; the average of the upper bound is

2M(f(y)f(x))+2Mf(x),xy=2M(f(y)f(x)) 2M (f(y) - f(x)) + 2M \langle \nabla f(x), x-y\rangle = 2M(f(y) - f(x))

since f(x)=0\nabla f(x) = 0. Plugging this in (15) gives (11).

References

This proof was shamefully extracted from the magnificient survey on gradient descent algorithms by Gower and Garrigos.