πŸ‹πŸΌ Heavy tails I: extremes events and randomness

November 2023

The famous Pareto principle states that, when many independent sources contribute to a quantitative phenomenon, roughly 80% of the total originates from 20% of the sources:Β 80% of the wealth is owned by (less than) 20% of the people, you wear the same 20% of your wardrobe 80% of the time, 20% of your efforts are responsible for 80% of your grades, 80% of your website traffic comes from 20% of your content, etc.

This phenomenon mostly comes from a severe imbalance of the underlying probability distribution: for each sample, there is a not-so-small probability of this sample being unusually large. This is what we call heavy tails. In this post, we'll give a mathematical definition, a few examples, and show how they lead to Pareto-like principles.

The tail distribution of a random variable

If XX is a random number, the tail of its distribution is the probability of XX taking large values: G(x)=P(X>x)G(x)=\mathbb{P}(X>x). Of course, this function is decreasing in xx;Β the question is, how fast?

Light tails

Some distributions from classical probability have tails which decrease quickly toward zero. It is the case for Gaussian random variables: a classical equivalent shows that if X∼N(0,1)X \sim \mathscr{N}(0,1), then when xx is very large[1]

P(X>x)∼eβˆ’x2/2x2Ο€. \mathbb{P}(X>x) \sim \frac{e^{-x^2/2}}{x\sqrt{2\pi}}.

This probability is overwhelmingly small. For example, P(X>5)\mathbb{P}(X>5) is already smaller than 0.00001%0.00001\%. It means that if you draw 1000010000 samples from XX the probability of having one of those samples greater than 55 is approximately 0.3%0.3\%. That's possible, but very rare.

Heavy tails

A distribution is heavy-tailed when P(X>x)\mathbb{P}(X>x) does not decay as fast as eβˆ’x2e^{-x^2} or even eβˆ’xe^{-x}, but rather like inverses of polynomials: 1/x1/x or 1/x21/x^2 for example. That's the case with the ratio of two standard Gaussian variables X/YX/Y, whose density is 1/(Ο€(1+x2))1/(\pi(1+x^2)). For this distribution called the Cauchy distribution[2], a direct computation gives

P(X>x)=12βˆ’arctan(x)Ο€βˆΌ1Ο€x. \mathbb{P}(X>x) = \frac{1}{2} - \frac{\mathrm{arctan}(x)}{\pi} \sim \frac{1}{\pi x}.

This decays very slowly. For example P(X>5)β‰ˆ1/5Ο€β‰ˆ6%\mathbb{P}(X>5)\approx 1/5\pi \approx 6\%, which means that if you draw as few as 100 samples from XX then you will see one of the samples larger than 5 with probability 99.8%99.8\%. That's a very different behaviour than the preceding Gaussian example.

Mathematically, we say that a distribution is heavy-tailed if P(X>x)\mathbb{P}(X>x) is asymptotically comparable to 1/xs1/x^s for some ss. By "asymptotically comparable", we mean that terms like log⁑(x)\log(x) should not count. There is a class of functions, called slowly varying, encompassing this: they are all the functions which are essentially somewhere between constant and logarithmic. Just forget about this technical point: for the rest of the note, think of "regularly varying" as "almost constant". I will keep this denomination.

Definition. A distribution is heavy-tailed if there is an essentially constant function cc and an index s>0s>0 such that P(X>x)=c(x)xs. \mathbb{P}(X>x) = \frac{c(x)}{x^s}.

The same definition holds for xβ†’βˆ’βˆžx\to -\infty.

Densities

If XX has a density ff, then one can generally see the heavy-tail of XX in the asymptotics of ff. Roughly speaking, if for example f(x)β‰ˆc/xs+1f(x) \approx c / x^{s+1} when xx is large, then P(X>x)β‰ˆc∫x∞xβˆ’sβˆ’1dx=cβ€²xβˆ’s\mathbb{P}(X > x) \approx c\int_x^\infty x^{-s-1}dx = c' x^{-s}, so XX is heavy-tailed with index ss. Most of our following examples will have densities.

Log-scales

On the left, you see the two densities mentioned above:Β Gaussian and Cauchy. On classical plots like this, it is almost impossible to see if something is heavy-tailed or not. The orange curve seems to go to zero slower than the blue one, but at which rate? This is why, when it comes to heavy tails, log-scales on plots are ubiquitous. The plot on the middle is the same as the one on the left, but with a log-scale on the y-axis; and on the right, both axes have a log-scale.

In fact, if XX has a density ff, then discerning by bare visual inspection if f(x)f(x) decays to zero rather polynomially or exponentially is almost impossible. But on a log-log scale,

This is why most plots you will see on this topic are on log-scales or log-log scales.

Examples of heavy-tailed densities

The Power-Law, or Pareto distribution

The most basic and important example of a heavy-tailed distribution is the Pareto one. It is directly parametrized by its index s>0s>0 and its density is given by

ρs(x)=s1x>1xs+1. \rho_s(x) = \frac{s\mathbf{1}_{x>1}}{x^{s+1}}.

I started the distribution at 1, but some people add its starting point as a second parameter. It's really not important. The Pareto law is often denoted PL(s)\mathrm{PL}(s).

Other heavy-tailed densities

The FrΓ©chet distribution has CDF and PDF

Fs(x)=eβˆ’xβˆ’sρs(x)=s1x>0x1+seβˆ’xβˆ’s. F_s(x) = e^{-x^{-s}} \qquad \rho_s(x) = \frac{s\mathbf{1}_{x>0}}{x^{1+s}}e^{-x^{-s}}.

Closely related is the Inverse Gamma distribution,with

FΞ»,s(x)=Ξ“(s,Ξ»/x)Ξ“(s)ρλ,s(x)=Ξ»sΞ“(s)1x>0xs+1eβˆ’Ξ»x.F_{\lambda, s}(x) = \frac{\Gamma(s, \lambda/x)}{\Gamma(s)} \qquad \rho_{\lambda,s}(x) = \frac{\lambda^s}{\Gamma(s)}\frac{\mathbf{1}_{x>0}}{x^{s+1}}e^{-\frac{\lambda}{x}}.

When s=1/2s=1/2, this is often called a LΓ©vy distribution. It is a special case of the Ξ±\alpha-stable distributions, which encompasses the Cauchy distribution. There is also the Burr distribution (of type XII), with density

f(x)=ckxcβˆ’1(1+xc)k+1.f(x) = \frac{ckx^{c-1}}{(1+x^c)^{k+1}}.

A shortlist of mechanisms leading to heavy tails.

There are many survey papers on why heavy tails do appear in the real world, like Newman's one. In general, the most ubiquitous cases are the following ones:Β 

The Lorenz curve of heavy-tailed distributions

Now, let us see how heavy tails are the kind of distributions accountable for imbalances like the 80-20 principle. In general, we measure such imbalances using the Lorenz curve: this curves gives the amount of mass "produced" by the tt-th quantile of a probability distribution. By "mass", we mean the mathematical expectation. The correct definition of the Lorenz curve is the curve joining all points (F(x),M(x))(F(x), M(x)) for all xx, where F(x)=P(X<x)F(x) = \mathbb{P}(X < x) is the proportion of samples below level xx, and

M(x)=E[X1X<x]E[X]M(x) = \frac{\mathbb{E}[X\mathbf{1}_{X < x}]}{\mathbb{E}[X]}

is the proportion of the total mean E[X]\mathbb{E}[X] coming from samples below xx. This is the same curve as (t,M(q(t))(t, M(q(t)) where q=Fβˆ’1q = F^{-1} is the quantile function.

For Pareto distributions, we have F(x)=1βˆ’tβˆ’sF(x) = 1 - t^{-s} hence q(t)=(1βˆ’t)βˆ’1/sq(t) = (1-t)^{-1/s}. On the other hand E[X]=s/(sβˆ’1)\mathbb{E}[X] = s/(s-1) and

E[X1X<x]=s∫1xyys+1dy=ssβˆ’1(1βˆ’1x1βˆ’s)\mathbb{E}[X\mathbf{1}_{X < x}] = s\int_1^x \frac{y}{y^{s+1}} dy = \frac{s}{s-1}\left(1 - \frac{1}{x^{1-s}}\right)

so that M(x)=1βˆ’xsβˆ’1M(x) = 1 - x^{s-1}. A mere computation gives the following picture.

Mass imbalance in heavy-tails.

  • The Lorenz curve of a PL(s)\mathrm{PL}(s) is given by t↦1βˆ’(1βˆ’t)1βˆ’1st \mapsto 1 - (1 - t)^{1 - \frac{1}{s}}.

  • The quantile contributing the last 80% of the total mass is given by q0.8(s)=1βˆ’0.8s/(sβˆ’1)\mathfrak{q}_{0.8}(s) = 1 - 0.8^{s/(s-1)}. I call this quantile the "Pareto Index" in the next picture below.

  • Pareto's "80-20 principle" corresponds to s=1.16s=1.16.

As shown with the dotted lines, the tail-index which seems to fit the 80-20 principle is s=1.16s=1.16. For this index (as for any index 1<s<21 < s < 2), the Pareto distribution has a finite mean but no finite variance. For general heavy-tailed distributions, we would have the same kind of pictures, with really convex Lorenz curves. Of course, since the distribution becomes more and more heavy-tailed when ss is closer and closer to 1, these curves are less convex when ss increases.

In general, estimating the heavy-tail index ss is a difficult task. I have a note on this topic where I describe the Hill estimator.

References


[1] This estimate is very precise:Β the error is of order eβˆ’x2/2/x3e^{- x^2/2}/x^3, so as long as xx is greater than 55 it is smaller than 0.000000010.00000001.

[2] Indeed, arctan⁑(x)+arctan⁑(1/x)=Ο€/2\arctan(x) + \arctan(1/x) = \pi/2 and when t=1/xt=1/x is close to zero arctan⁑(t)βˆΌΟ€/2βˆ’1/t\arctan(t)\sim \pi/2 - 1/t.