🏋🏼 Heavy-tails II: is it really heavy?

December 2023

In the preceding note on heavy tails, we defined heavy-tailed distributions as distributions for which the excess function $\mathbb{P}(X>x)$ decays polynomially fast towards zero, and not exponentially fast (as for Gaussians for example). Formally, we asked

\mathbb{P}(X>x) = \frac{c(x)}{x^s}

where $c(x)$ is some unspecified function with slow variation (ie, at infinity it is almost a constant, like $\log \log \log (x)$ ) and $s>0$ is called the heavy-tail index.

This raises two questions, which are statistical in nature. Suppose you are given samples $X_1, \dotsc, X_n$ from a same distribution.

How do check if their common distribution is heavy-tailed?
How do you estimate its tail index $s$ ?

In this post, we'll focus on the second question, which in practice often solves the first one. For illustration purposes, I generated three datasets whose histograms (in log-log scales) are the following ones:

The first one does not really look heavy-tailed. The two others sure look like heavy tails. But what could possibly be their index?

Naive estimators: histograms and rank plots

Histogram

In a histogram, we count the number of observations on small bins, say $[k\varepsilon, (k+1)\varepsilon]$ . The proportion of observations in any bin $[k\varepsilon,(k+1)\varepsilon]$ should be $G((k+1)\varepsilon) - G(k\varepsilon) \approx G'(k\varepsilon) \varepsilon\approx \varepsilon/(k\varepsilon)^{s+1}$ . The tails of the histogram should thus decay like $k^{-s-1}$ . We can also use this idea to estimate $s$ : suppose that $p(k)$ is the height of the bin $[k\varepsilon, (k+1)\varepsilon]$ . Then we should have $p(k)\approx c/k^{s+1}$ and so $\log p(k) \approx c - (s+1) \log(k)$ . With a linear regression of $\log p(k)$ on $\log(k)$ we get an estimate for $s+1$ .

Rank plot

There is a better way: instead of drawing histograms, one might draw "rank plots": for every $k\varepsilon$ , plot the number of observations larger than $k\varepsilon$ . These plots are less sensitive to noise in the observations.

Again, we can use this idea to estimate the tail-index $s$ . Indeed, if $Q(k)$ is the proportion of observations greater than $k\varepsilon$ , then we should have $Q(k) \approx G(k\varepsilon) \approx c/(k\varepsilon)^s$ , or $\log Q(k) \approx \mathrm{cst} - s \log(k)$ . A linear regression would give another estimate, directly for $s$ .

That's better, but still not very convincing.

Estimating the tail index in a Pareto model

Computing the Maximum-likelihood estimator

If we suspect that the $X_i$ were generated from a specific parametric family of densities (Pareto, Weibull, etc.), we can try to estimate the tail-index using methods from classical parametric statistics, like Maximum Likelihood. Suppose we know that the $X_i$ come from a Pareto distribution, whose density is $f_s(x) = sx^{-s-1}$ over $[1,\infty[$ . The likelihood of the observations is then

\ell(s) = \sum_{i=1}^n \ln(s) - (s+1)\ln(x_i)

which is maximized when $n/s = \sum \ln (x_i)$ . Consequently, the MLE estimator for $s$ is:

\hat{s} = \frac{1}{\frac{1}{n}\sum_{i=1}^n \ln(x_i)}.

If the $X_i$ were truly generated by a Pareto model, this estimator has good properties: it is consistent and asymptotically Gaussian, indeed $\sqrt{n}(\hat{s} - s)$ is approximately $\mathscr{N}(0,s^2)$ when $n$ is large. This opens the door to statistical significance tests.

For the three datasets showed earlier, the MLE estimators are:

| 1.7      | 0.43       | 1.1      |

Up to now, with three different estimators, we had very different estimates for $s$ …

Censoring

But what if my data are not Pareto-distributed? A possible approach goes as follows. Suppose that the $X_i$ follow some heavy-tailed distribution with density $f$ and that $f(x)\sim cx^{-s-1}$ for some constant $c$ . Then it means that, for large $x$ , the density $f$ is almost the same as a Pareto. Hence, we could simply forget about the $X_i$ which are too small and only keep the large ones above some threshold $M$ (after all, they are the ones responsible for the tail behaviour), then apply the MLE estimator. This is called censorship; typically, if $N$ is the number of observations above the chosen threshold, then

\hat{s}_{\mathrm{censored}} = \frac{1}{\frac{1}{N}\sum_{i=1}^n \mathbf{1}_{X_i > M}\ln(X_i/M)}.

This estimator is bad. Indeed,

it is biased and/or not consistent for many models^[1].
It is extremely sensitive to the choice of the threshold $M$ .

To avoid these pitfalls, the idea is to play with various levels of censorship $M$ ; that is, to choose a threshold $M_n$ which grows larger with $n$ . This leads to the Hill estimator.

Hill's estimator

What would be the best choice for $M$ in (4)? First, we note that $\hat{s}_M$ undergoes discontinuous changes when $M$ becomes equal to one of the $X_i$ , since in this case one term is added to the sum and the number of indices such that $X_k > M$ is incremented by 1. It is thus natural to replace $M$ with the largest samples, say $X_{(1)}, X_{(2)}…$ where $X_{(1)}>…>X_{(n)}$ are the ordered samples:

\hat{s}_k = \frac{1}{\frac{1}{k}\sum_{i=k}^n \ln(X_{(i)}/X_{(k)})}.

Here, $k$ is the number of $X_i$ larger than $X_{(k)}$ , so (5) fits the definition of the censored estimator (4) with $M = X_{(k)}$ . The crucial difference with (4) is that the threshold is now random.

We still have to explain how to choose $k$ . It turns out that many choices are possible, and they depend on $n$ , as stated by the following essential theorem.

Hill's estimator is consistent. Set

k=k_n

, and suppose that

k_n / n \to 0

and

k_n \to \infty

. If the

X_i

are iid samples from a heavy-tailed distribution with index

s

in the general sense of (??), then

\hat{s}_{k_n}

converges in probability towards

s

Proof in the case of Pareto random variables. If the $X_i$ have the Pareto density $\mathbf{1}_{[1,\infty[}sx^{-s-1}$ , then it is easy to check that $E_i = \ln(X_i)$ are Exponential random variables with rate $s$ . The distribution of the ordered tuple $(\ln(X_{(1)}), \dotsc, \ln(X_{(n)}))$ is thus the distribution of $(E_{(1)}, \dotsc, E_{(n)})$ . This distribution has a very nice representation due to the exponential density. Indeed, for any test function $\varphi$ ,

\begin{aligned} \mathbb{E}[\varphi(E_{(1)}, \dotsc, E_{(n)})]&= \sum_{\sigma}\mathbb{E}[\varphi(E_{\sigma(1)}, \dotsc, E_{\sigma(n)})\mathbf{1}_{E_{\sigma(1)}>\dotsb > E_{\sigma(n)}}]\\ &= \sum_\sigma \int_{x_1>\dotsb>x_n} \varphi(x_1, \dotsc, x_n)e^{-sx_1 - \dotsb - sx_n}dx_1\dots dx_n \\ &= \sum_\sigma \int_{u_1>0, \dotsc, u_n>0} \varphi(u_n + \dotsb + u_1, \dotsc, u_{n-1} + u_n, u_n)e^{-su_n - (su_n + su_{n-1}) - \dotsc -(su_n + \dotsb + su_1)}du_1… du_n \\ &= n! \int_{u_1>0, \dotsc, u_n>0} \varphi(u_n + \dotsb + u_1, \dotsc, u_{n-1} + u_n, u_n)e^{-ns u_n - (n-1) su_{n-1} - \dotsc - su_1}du_1… du_n \\ &= \mathbb{E}[\varphi(F_n + \dotsb + F_1, \dotsc, F_{n-1} + F_n, F_n)] \end{aligned}

where the

F_i

are independent exponential random variables,

F_i \sim \mathscr{E}(si)

. Moreover, if

E \sim \mathscr{E}(s)

then

E/\lambda \sim \mathscr{E}(s\lambda)

: since the

E_i

are all iid we thus have

F_i \stackrel{\mathrm{law}}{=}E_{i}/i

. In the end, we proved that

(E_{(1)}, \dotsc, E_{(n)}) \stackrel{\mathrm{law}}{=}\left( \sum_{i=1}^{n}\frac{E_i}{i}, \dotsc, \frac{E_{n-1}}{n-1} + \frac{E_{n}}{n}, \frac{E_n}{n}, \right),

in other words $E_{(i)}$ has the same distribution as $\sum_{j=i}^n E_j/j$ . As a consequence,

\begin{aligned} \frac{1}{k}\sum_{i=1}^k \ln(X_{(i)}/X_{(k)}) &= \frac{1}{k}\sum_{i=1}^k E_{(i)} - E_{(k)} \stackrel{\mathrm{law}}{=}\frac{1}{k}\sum_{i=1}^k \sum_{j=i+1}^{n} \frac{E_j}{j} = \frac{1}{k}\sum_{j=1}^{k-1}E_j. \end{aligned}

As long as

k=k_n \to \infty

, this is the empirical mean of

\mathscr{E}(s)

random variables, and it converges towards their common mean

1/s

as requested. Note that with this representation, we can also infer a CLT.

Proof in the general case. It is rougly the same idea, but with more technicalities coming from the properties of slowly varying functions. To give the sketch, if $F$ is the cdf of the law of the $X_i$ , then we can write $X_i = F^{-1}(U_i)$ where the $U_i$ are uniform on $[0,1]$ . It turns out that if $F$ satisfies (??), then its inverse can be written as $F^{-1}(u) = (1-u)^{-1/s}\ell(1/(1-u))$ – this is a difficult result due to Karamata. Then, $\log F^{-1}(X_i)$ can be represented as

s^{-1}\log 1/U_i + \log \ell(1/U_i).

The first term is precisely an $\mathscr{E}(s)$ random variable, hence the analysis of the Pareto case holds. Then, we must show that the second term, when summed over $i=1, \dotsc, k$ , does not contribute. This is again a consequence of slow variation.

Typically, $k_n$ could be $\lfloor \ln n \rfloor$ or $\lfloor \sqrt{n}\rfloor$ or even $\lfloor (\log \log \log n)^2\rfloor$ . That leaves many possible choices. In practice, instead of picking one such $k$ , statisticians prefer plotting $\hat{s}_k$ for all the possible $k$ and observe the general shape of the plot of $k\mapsto \hat{s}_k$ , which is called Hill's plot. Of course, Hill's theorem above says that most of the information on $s$ is contained in a small region at the beginning of the plot, a region where $k$ is small with respect to $n$ (ie $k/n \to 0$ ), but not too small (ie $k\to\infty$ ). It often happens that the Hill plot is almost constant on this small region, which can be seen for example in the second plot.

There's a whole science for reading Hill plots.

It turns out that the three datasets I created for this post are as follows:

the first one is an exponential distribution with parameter $1$ . It has a light tail.
the second one is a Pareto distribution with index $s=0.5$ .
the third one is (the absolute of) a Cauchy distribution, therefore its index is $s=1$ .

References

Hill's estimator. Hill's PhD-grandfather was Wald.
The manual Fundamentals of heavy tails is very clear on this topic.

^[1] It is easy to see that if the $X_i$ come from a mixture of two Pareto distributions with different parameters $s_1, s_2$ , then the censored MLE estimator will not converge towards the true tail index wich is $\min\{s_1, s_2\}$ .