Gradient descent III: SGD for Polyak-Łojasiewicz functions

October 2023

In the preceding note on stochastic gradient descent we proved some convergence guarantees for functions satisfying strong convexity assumptions. Now, we move on to a more realistic setting. A differentiable function f:RdRf:\mathbb{R}^d \to \mathbb{R} is η\eta-Polyak-Łojasiewicz[1] (η\eta-PL for short) if it is bounded below and if for every xx,

f(x)inff12ηf(x)2. f(x) - \inf f \leqslant \frac{1}{2\eta}|\nabla f(x)|^2.
Fact. η\eta-strong convex functions are η\eta-PL.

Proof. If xx is a minimizer of ff then for any yy,

f(x)f(y)f(y),yx+(η/2)yx2=f(y)/μμ(yx)2/2f(y)2/(2μ).f(x) - f(y) \geqslant \langle \nabla f(y), y-x\rangle + (\eta/2)|y-x|^2 = |\nabla f(y)/\sqrt{\mu} - \sqrt{\mu}(y-x)|^2 / 2 -|\nabla f(y)|^2 / (2\mu).

Flipping sides and using f(x)=infff(x) = \inf f and f(y)/μμ(yx)20|\nabla f(y)/\sqrt{\mu} - \sqrt{\mu}(y-x)|^2\geq 0 yields (1).

What is really impressive is that most results on gradient descent or SGD transfer directly from strongly convex functions to PL functions... but PL functions need not even be convex! This is a great leap forward for optimization, and it's quite a recent finding.

Two important examples of PL functions which are not convex are

These examples show that PL functions form a really wider class than (strongly) convex functions. On the other hand, they need not have a unique minimizer, which is the case for strongly convex functions. Indeed, (1) implies that any critical point (f(x)=0\nabla f (x) = 0) has f(x)=infff(x) = \inf f. In other words, local minima are not unique at all, but they are all global, see (b) in the Figure below[2].

Consequently, optimization algorithms will not find minimizers of ff but they will search for the minimum value of ff instead. Just as for strongly convex functions, there is a general convergence result for vanilla Gradient Descent, but I'll only present the corresponding result for SGD. That is, we are now in the minimization of a function ff assuming the shape

f(x)=1ni=1mfi(x) f(x) = \frac{1}{n}\sum_{i=1}^m f_i(x)

where the fi:RdRf_i : \mathbb{R}^d \to \mathbb{R} are PL functions. The SGD is given by the updates

xn+1=xnεnfin(xn) x_{n+1} = x_n - \varepsilon_n \nabla f_{i_n}(x_n)

where for each step nn, the index ini_n is independent of xnx_n and inUnif{1,,d}i_n \sim \mathrm{Unif}\{1,\dotsc, d\}. For future reference, we recall that ff is MM-smooth if f(y)f(x)f(x),yx+M/2xy2f(y)-f(x)\leqslant \langle \nabla f(x), y-x\rangle + M/2 |x-y|^2, see the note on GD.

If all the fif_i are η\eta-PL and MM-smooth, and if the step size ε\varepsilon is smaller than η/M2\eta / M^2, then

E[f(xn+1)]inff(1εη)nf(x0)+M2εηδ \mathbb{E}[f(x_{n+1})] - \inf f \leqslant (1 - \varepsilon \eta)^n f(x_0) + \frac{M^2\varepsilon}{\eta}\delta

where δ\delta is the local discrepancy of ff around its minimal value:

δ=infxE[fi(x)]E[infxfi(x)]0iUnif{1,,m}.\delta = \inf_x \mathbb{E}[f_i(x)] - \mathbb{E}[\inf_x f_i(x)] \geqslant 0 \qquad i \sim \mathrm{Unif}\{1, \dotsc, m\}.

The difference between this and the classical GD result is the additional term featuring δ\delta. Strikingly, when all the fif_i have the same minimal value (ie inffi=inff\inf f_i = \inf f), this term disappears since δ=0\delta = 0. This situation is called interpolation. It does imply that all the fif_i can reach their minimal value at the same time. One might be tempted to check how this δ\delta relates to the local variance σ\sigma introduced in the case of strongly convex functions. It turns out[3] that if all the fif_i are MM-smooth, (i) in general, σ2Mδ\sigma \leqslant 2M\delta; (ii) if the fif_i are μ\mu-strongly convex, then 2μδσ2\mu \delta \leqslant \sigma. So in the case of strongly convex functions, these two parameters measure the same thing.

Proof

It is once again the same kind of tricks as for GD and SGD, but now we directly apply them to f(xn)f(x_n) instead of applying them to xnx_n. For simplicity, we will suppose that the minimum value of ff is zero, ie inff=0\inf f = 0.

Writing the fact that ff is MM-smooth (since if all the fif_i are MM-smooth), then using the definition of the SGD update rule, we have

f(xn+1)f(xn)f(xn),xn+1xn+M2xn+1xn2=εf(xn),fin(xn)+Mε22fin(xn)2.\begin{aligned}f(x_{n+1}) - f(x_n) &\leqslant \langle \nabla f(x_n), x_{n+1} - x_n\rangle + \frac{M}{2}|x_{n+1} - x_n|^2 \\ &= - \varepsilon \langle \nabla f (x_n), \nabla f_{i_n}(x_n)\rangle + \frac{M\varepsilon^2}{2}|\nabla f_{i_n}(x_n)|^2. \end{aligned}

We now take conditional expectations with respect to xnx_n. Since En[fin(x)]=f(x)\mathbb{E}_n[\nabla f_{i_n}(x)] = \nabla f (x), we get

f(xn+1)f(xn)εf(xn)2+Mε22En[fin(xn)2].f(x_{n+1}) \leqslant f(x_n) - \varepsilon | \nabla f(x_n)|^2 + \frac{M\varepsilon^2}{2}\mathbb{E}_n[|\nabla f_{i_n}(x_n)|^2].

Now we need to estimate the latter expectation and for this, we use the following elementary lemma.

Lemma. If gg is MM-smooth, g(x)22M(g(x)infg). |\nabla g(x)|^2 \leqslant 2M (g(x) - \inf g).

The proof is left as an exercise. Incidentally, this fact sheds some light on the PL-condition. In fact, PL functions satisfy the opposite inequality up to a constant. This is why PL functions efficiently replace the strong convexity assumption.

Apply (9) to the MM-smooth function finf_{i_n} to get

En[fin(xn)2]2MEn[fin(xn)inffin]=2MEn[fin(xn)fin(x)]+2MEn[fin(x)inffin]=2Mf(xn)+2Mδ.\begin{aligned}\mathbb{E}_n[|\nabla f_{i_n}(x_n)|^2] &\leqslant 2M \mathbb{E}_n [f_{i_n}(x_n) - \inf f_{i_n}] \\ &= 2M\mathbb{E}_n [f_{i_n}(x_n) - f_{i_n}(x)] + 2M\mathbb{E}_n[f_{i_n}(x) - \inf f_{i_n}] \\ &= 2Mf(x_n) + 2M\delta. \end{aligned}

Now, plug these inequalities in (8) to get

f(xn+1)f(xn)εf(xn)2+M2ε2f(xn)+M2ε2δ.f(x_{n+1}) \leqslant f(x_n) - \varepsilon | \nabla f(x_n)|^2 + M^2\varepsilon^2 f(x_n) + M^2\varepsilon^2 \delta.

We did not use the PL condition – but not for much longer. When reversed, (1) tells us that f(xn)22ηf(xn)-|\nabla f(x_n)|^2 \leqslant -2\eta f(x_n). We plug this into the last inequality, reorganize terms, take expectations, and we get

E[f(xn+1)](12ηε+M2ε2)E[f(xn)]+M2ε2δ.\mathbb{E}[f(x_{n+1})] \leqslant (1- 2\eta \varepsilon + M^2\varepsilon^2) \mathbb{E}[f(x_n)] + M^2\varepsilon^2 \delta.

It's over now. The first factor is 1+ε(M2ε2μ)1 + \varepsilon (M^2\varepsilon - 2\mu) and we chose the stepsize ε\varepsilon precisely so that M2ε2η<ηM^2\varepsilon - 2\eta < -\eta. The final recursion reads

E[f(xn+1)](1εη)E[f(xn)]+(Mε)2δ. \mathbb{E}[f(x_{n+1})] \leqslant (1 - \varepsilon \eta)\mathbb{E}[f(x_n)] + (M\varepsilon)^2\delta.

This is once again a classical recursion of the form un+1aun+bu_{n+1} \leqslant a u_n + b and it is easily leveraged to un+1b(1an)/(1a)+anu0u_{n+1}\leqslant b(1 - a^n )/(1-a) + a^n u_0, hence

E[f(xn+1)](1εη)nf(x0)+M2εηδ. \mathbb{E}[f(x_{n+1})] \leqslant (1 - \varepsilon \eta)^n f(x_0) + \frac{M^2\varepsilon}{\eta}\delta.

Discussion

References

These three notes on GD and SGD were mostly taken from the wonderful survey by Guillaume Garrigos and Robert Gower, here. Guillaume proofread this note. Most of the remaining errors are due to him.


[1] The Polish Ł is actually pronounced like the French « ou » or the English « w ». So Łojasiewicz sounds like « Wo-ja-see-yeah-vitch ». Among well-known Polish names with this letter, there's also Rafał Latała, Jan Łukasiewicz or my former colleague Bartłomiej Błaszczyszyn whose name is legendarily hard to pronounce for Westerners.

[2] Figure extracted from the Liu, Zhu, Belkin paper.

[3] See Lemma 4.18 in the survey.