In the preceding note on heavy tails, we defined heavy-tailed distributions as distributions for which the excess function $\mathbb{P}(X>x)$ decays polynomially fast towards zero, and not exponentially fast (as for Gaussians for example). Formally, we asked

$\mathbb{P}(X>x) = \frac{c(x)}{x^s}$where $c(x)$ is some unspecified function with slow variation (ie, at infinity it is almost a constant, like $\log \log \log (x)$) and $s>0$ is called the **heavy-tail index**.

This raises two questions, which are statistical in nature. Suppose you are given samples $X_1, \dotsc, X_n$ from a same distribution.

How do check if their common distribution is heavy-tailed?

How do you estimate its tail index $s$?

In this post, we'll focus on the second question, which in practice often solves the first one. For illustration purposes, I generated three datasets whose histograms (in log-log scales) are the following ones:Β

The first one does not really look heavy-tailed. The two others sure look like heavy tails. But what could possibly be their index?

In a histogram, we count the number of observations on small bins, say $[k\varepsilon, (k+1)\varepsilon]$. The proportion of observations in any bin $[k\varepsilon,(k+1)\varepsilon]$ should be $G((k+1)\varepsilon) - G(k\varepsilon) \approx G'(k\varepsilon) \varepsilon\approx \varepsilon/(k\varepsilon)^{s+1}$. The tails of the histogram should thus decay like $k^{-s-1}$. We can also use this idea to estimate $s$:Β suppose that $p(k)$ is the height of the bin $[k\varepsilon, (k+1)\varepsilon]$. Then we should have $p(k)\approx c/k^{s+1}$ and so $\log p(k) \approx c - (s+1) \log(k)$. With a linear regression of $\log p(k)$ on $\log(k)$ we get an estimate for $s+1$.

There is a better way:Β instead of drawing histograms, one might draw "rank plots":Β for every $k\varepsilon$, plot the number of observations larger than $k\varepsilon$. These plots are less sensitive to noise in the observations.

Again, we can use this idea to estimate the tail-index $s$. Indeed, if $Q(k)$ is the proportion of observations greater than $k\varepsilon$, then we should have $Q(k) \approx G(k\varepsilon) \approx c/(k\varepsilon)^s$, or $\log Q(k) \approx \mathrm{cst} - s \log(k)$. A linear regression would give another estimate, directly for $s$.

That's better, but still not very convincing.

If we suspect that the $X_i$ were generated from a specific parametric family of densities (Pareto, Weibull, etc.), we can try to estimate the tail-index using methods from classical parametric statistics, like Maximum Likelihood. Suppose we know that the $X_i$ come from a Pareto distribution, whose density is $f_s(x) = sx^{-s-1}$ over $[1,\infty[$. The likelihood of the observations is then

$\ell(s) = \sum_{i=1}^n \ln(s) - (s+1)\ln(x_i)$which is maximized when $n/s = \sum \ln (x_i)$. Consequently, the MLEΒ estimator for $s$ is:Β

$\hat{s} = \frac{1}{\frac{1}{n}\sum_{i=1}^n \ln(x_i)}.$

If the $X_i$ were truly generated by a Pareto model, this estimator has good properties: it is consistent and asymptotically Gaussian, indeed $\sqrt{n}(\hat{s} - s)$ is approximately $\mathscr{N}(0,s^2)$ when $n$ is large. This opens the door to statistical significance tests.

For the three datasets showed earlier, the MLEΒ estimators are:

`| 1.7 | 0.43 | 1.1 |`

Up to now, with three different estimators, we had very different estimates for $s$β¦

But what if my data are *not* Pareto-distributed? A possible approach goes as follows. Suppose that the $X_i$ follow some heavy-tailed distribution with density $f$ and that $f(x)\sim cx^{-s-1}$ for some constant $c$. Then it means that, for large $x$, the density $f$ is almost the same as a Pareto. Hence, we could simply forget about the $X_i$ which are too small and only keep the large ones above some threshold $M$ (after all, they are the ones responsible for the tail behaviour), then apply the MLEΒ estimator. This is called *censorship*;Β typically, if $N$ is the number of observations above the chosen threshold, then

This estimator is bad. Indeed,

it is biased and/or not consistent for many models

^{[1]}.It is extremely sensitive to the choice of the threshold $M$.

To avoid these pitfalls, the idea is to play with various levels of censorship $M$;Β that is, to choose a threshold $M_n$ which grows larger with $n$. This leads to the Hill estimator.

What would be the best choice for $M$ in (4)? First, we note that $\hat{s}_M$ undergoes discontinuous changes when $M$ becomes equal to one of the $X_i$, since in this case one term is added to the sum and the number of indices such that $X_k > M$ is incremented by 1. It is thus natural to replace $M$ with the largest samples, say $X_{(1)}, X_{(2)}β¦$ where $X_{(1)}>β¦>X_{(n)}$ are the ordered samples:

Here, $k$ is the number of $X_i$ larger than $X_{(k)}$, so (5) fits the definition of the censored estimator (4) with $M = X_{(k)}$. The crucial difference with (4) is that the threshold is now *random*.

We still have to explain how to choose $k$. It turns out that many choices are possible, and they depend on $n$, as stated by the following essential theorem.

**Proof in the case of Pareto random variables.** If the $X_i$ have the Pareto density $\mathbf{1}_{[1,\infty[}sx^{-s-1}$, then it is easy to check that $E_i = \ln(X_i)$ are Exponential random variables with rate $s$. The distribution of the ordered tuple $(\ln(X_{(1)}), \dotsc, \ln(X_{(n)}))$ is thus the distribution of $(E_{(1)}, \dotsc, E_{(n)})$. This distribution has a very nice representation due to the exponential density. Indeed, for any test function $\varphi$,

$\begin{aligned}
\mathbb{E}[\varphi(E_{(1)}, \dotsc, E_{(n)})]&= \sum_{\sigma}\mathbb{E}[\varphi(E_{\sigma(1)}, \dotsc, E_{\sigma(n)})\mathbf{1}_{E_{\sigma(1)}>\dotsb > E_{\sigma(n)}}]\\
&= \sum_\sigma \int_{x_1>\dotsb>x_n} \varphi(x_1, \dotsc, x_n)e^{-sx_1 - \dotsb - sx_n}dx_1\dots dx_n \\
&= \sum_\sigma \int_{u_1>0, \dotsc, u_n>0} \varphi(u_n + \dotsb + u_1, \dotsc, u_{n-1} + u_n, u_n)e^{-su_n - (su_n + su_{n-1}) - \dotsc -(su_n + \dotsb + su_1)}du_1β¦ du_n \\
&= n! \int_{u_1>0, \dotsc, u_n>0} \varphi(u_n + \dotsb + u_1, \dotsc, u_{n-1} + u_n, u_n)e^{-ns u_n - (n-1) su_{n-1} - \dotsc - su_1}du_1β¦ du_n \\
&= \mathbb{E}[\varphi(F_n + \dotsb + F_1, \dotsc, F_{n-1} + F_n, F_n)]
\end{aligned}$

where the $F_i$ are independent exponential random variables, $F_i \sim \mathscr{E}(si)$. Moreover, if $E \sim \mathscr{E}(s)$ then $E/\lambda \sim \mathscr{E}(s\lambda)$:Β since the $E_i$ are all iid we thus have $F_i \stackrel{\mathrm{law}}{=}E_{i}/i$. In the end, we proved that
$(E_{(1)}, \dotsc, E_{(n)}) \stackrel{\mathrm{law}}{=}\left( \sum_{i=1}^{n}\frac{E_i}{i}, \dotsc, \frac{E_{n-1}}{n-1} + \frac{E_{n}}{n}, \frac{E_n}{n}, \right),$
in other words $E_{(i)}$ has the same distribution as $\sum_{j=i}^n E_j/j$. As a consequence,

$\begin{aligned}
\frac{1}{k}\sum_{i=1}^k \ln(X_{(i)}/X_{(k)}) &= \frac{1}{k}\sum_{i=1}^k E_{(i)} - E_{(k)} \stackrel{\mathrm{law}}{=}\frac{1}{k}\sum_{i=1}^k \sum_{j=i+1}^{n} \frac{E_j}{j} = \frac{1}{k}\sum_{j=1}^{k-1}E_j.
\end{aligned}$

As long as $k=k_n \to \infty$, this is the empirical mean of $\mathscr{E}(s)$ random variables, and it converges towards their common mean $1/s$ as requested. Note that with this representation, we can also infer a CLT.
**Proof in the general case.** It is rougly the same idea, but with more technicalities coming from the properties of slowly varying functions. To give the sketch, if $F$ is the cdf of the law of the $X_i$, then we can write $X_i = F^{-1}(U_i)$ where the $U_i$ are uniform on $[0,1]$. It turns out that if $F$ satisfies (**??**), then its inverse can be written as $F^{-1}(u) = (1-u)^{-1/s}\ell(1/(1-u))$ β this is a difficult result due to Karamata. Then, $\log F^{-1}(X_i)$ can be represented as

The first term is precisely an $\mathscr{E}(s)$ random variable, hence the analysis of the Pareto case holds. Then, we must show that the second term, when summed over $i=1, \dotsc, k$, does not contribute. This is again a consequence of slow variation.

Typically, $k_n$ could be $\lfloor \ln n \rfloor$ or $\lfloor \sqrt{n}\rfloor$ or even $\lfloor (\log \log \log n)^2\rfloor$. That leaves many possible choices. In practice, instead of picking one such $k$, statisticians prefer plotting $\hat{s}_k$ for all the possible $k$ and observe the general shape of the plot of $k\mapsto \hat{s}_k$, which is called **Hill's plot**. Of course, Hill's theorem above says that most of the information on $s$ is contained in a small region at the beginning of the plot, a region where $k$ is small with respect to $n$ (ie $k/n \to 0$), but not too small (ie $k\to\infty$). It often happens that the Hill plot is almost constant on this small region, which can be seen for example in the second plot.

There's a whole science for reading Hill plots.

It turns out that the three datasets I created for this post are as follows:

the first one is an exponential distribution with parameter $1$. It has a light tail.

the second one is a Pareto distribution with index $s=0.5$.

the third one is (the absolute of) a Cauchy distribution, therefore its index is $s=1$.

Hill's estimator. Hill's PhD-grandfather was Wald.

The manual Fundamentals of heavy tails is very clear on this topic.

^{[1]} It is easy to see that if the $X_i$ come from a mixture of two Pareto distributions with different parameters $s_1, s_2$, then the censored MLEΒ estimator will not converge towards the true tail index wich is $\min\{s_1, s_2\}$.