πŸ‹πŸΌ Heavy-tails II: is it really heavy?

December 2023

In the preceding note on heavy tails, we defined heavy-tailed distributions as distributions for which the excess function P(X>x)\mathbb{P}(X>x) decays polynomially fast towards zero, and not exponentially fast (as for Gaussians for example). Formally, we asked

P(X>x)=c(x)xs \mathbb{P}(X>x) = \frac{c(x)}{x^s}

where c(x)c(x) is some unspecified function with slow variation (ie, at infinity it is almost a constant, like log⁑log⁑log⁑(x)\log \log \log (x)) and s>0s>0 is called the heavy-tail index.

This raises two questions, which are statistical in nature. Suppose you are given samples X1,…,XnX_1, \dotsc, X_n from a same distribution.

In this post, we'll focus on the second question, which in practice often solves the first one. For illustration purposes, I generated three datasets whose histograms (in log-log scales) are the following ones:Β 

The first one does not really look heavy-tailed. The two others sure look like heavy tails. But what could possibly be their index?

Naive estimators:Β histograms and rank plots

Histogram

In a histogram, we count the number of observations on small bins, say [kΞ΅,(k+1)Ξ΅][k\varepsilon, (k+1)\varepsilon]. The proportion of observations in any bin [kΞ΅,(k+1)Ξ΅][k\varepsilon,(k+1)\varepsilon] should be G((k+1)Ξ΅)βˆ’G(kΞ΅)β‰ˆGβ€²(kΞ΅)Ξ΅β‰ˆΞ΅/(kΞ΅)s+1G((k+1)\varepsilon) - G(k\varepsilon) \approx G'(k\varepsilon) \varepsilon\approx \varepsilon/(k\varepsilon)^{s+1}. The tails of the histogram should thus decay like kβˆ’sβˆ’1k^{-s-1}. We can also use this idea to estimate ss:Β suppose that p(k)p(k) is the height of the bin [kΞ΅,(k+1)Ξ΅][k\varepsilon, (k+1)\varepsilon]. Then we should have p(k)β‰ˆc/ks+1p(k)\approx c/k^{s+1} and so log⁑p(k)β‰ˆcβˆ’(s+1)log⁑(k)\log p(k) \approx c - (s+1) \log(k). With a linear regression of log⁑p(k)\log p(k) on log⁑(k)\log(k) we get an estimate for s+1s+1.

Rank plot

There is a better way:Β instead of drawing histograms, one might draw "rank plots":Β for every kΞ΅k\varepsilon, plot the number of observations larger than kΞ΅k\varepsilon. These plots are less sensitive to noise in the observations.

Again, we can use this idea to estimate the tail-index ss. Indeed, if Q(k)Q(k) is the proportion of observations greater than kΞ΅k\varepsilon, then we should have Q(k)β‰ˆG(kΞ΅)β‰ˆc/(kΞ΅)sQ(k) \approx G(k\varepsilon) \approx c/(k\varepsilon)^s, or log⁑Q(k)β‰ˆcstβˆ’slog⁑(k)\log Q(k) \approx \mathrm{cst} - s \log(k). A linear regression would give another estimate, directly for ss.

That's better, but still not very convincing.

Estimating the tail index in a Pareto model

Computing the Maximum-likelihood estimator

If we suspect that the XiX_i were generated from a specific parametric family of densities (Pareto, Weibull, etc.), we can try to estimate the tail-index using methods from classical parametric statistics, like Maximum Likelihood. Suppose we know that the XiX_i come from a Pareto distribution, whose density is fs(x)=sxβˆ’sβˆ’1f_s(x) = sx^{-s-1} over [1,∞[[1,\infty[. The likelihood of the observations is then

β„“(s)=βˆ‘i=1nln⁑(s)βˆ’(s+1)ln⁑(xi)\ell(s) = \sum_{i=1}^n \ln(s) - (s+1)\ln(x_i)

which is maximized when n/s=βˆ‘ln⁑(xi)n/s = \sum \ln (x_i). Consequently, the MLEΒ estimator for ss is:Β 

s^=11nβˆ‘i=1nln⁑(xi).\hat{s} = \frac{1}{\frac{1}{n}\sum_{i=1}^n \ln(x_i)}.

If the XiX_i were truly generated by a Pareto model, this estimator has good properties: it is consistent and asymptotically Gaussian, indeed n(s^βˆ’s)\sqrt{n}(\hat{s} - s) is approximately N(0,s2)\mathscr{N}(0,s^2) when nn is large. This opens the door to statistical significance tests.

For the three datasets showed earlier, the MLEΒ estimators are:

| 1.7      | 0.43       | 1.1      |

Up to now, with three different estimators, we had very different estimates for ss…

Censoring

But what if my data are not Pareto-distributed? A possible approach goes as follows. Suppose that the XiX_i follow some heavy-tailed distribution with density ff and that f(x)∼cxβˆ’sβˆ’1f(x)\sim cx^{-s-1} for some constant cc. Then it means that, for large xx, the density ff is almost the same as a Pareto. Hence, we could simply forget about the XiX_i which are too small and only keep the large ones above some threshold MM (after all, they are the ones responsible for the tail behaviour), then apply the MLEΒ estimator. This is called censorship;Β typically, if NN is the number of observations above the chosen threshold, then

s^censored=11Nβˆ‘i=1n1Xi>Mln⁑(Xi/M). \hat{s}_{\mathrm{censored}} = \frac{1}{\frac{1}{N}\sum_{i=1}^n \mathbf{1}_{X_i > M}\ln(X_i/M)}.

This estimator is bad. Indeed,

To avoid these pitfalls, the idea is to play with various levels of censorship MM;Β that is, to choose a threshold MnM_n which grows larger with nn. This leads to the Hill estimator.

Hill's estimator

What would be the best choice for MM in (4)? First, we note that s^M\hat{s}_M undergoes discontinuous changes when MM becomes equal to one of the XiX_i, since in this case one term is added to the sum and the number of indices such that Xk>MX_k > M is incremented by 1. It is thus natural to replace MM with the largest samples, say X(1),X(2)…X_{(1)}, X_{(2)}… where X(1)>…>X(n)X_{(1)}>…>X_{(n)} are the ordered samples:

s^k=11kβˆ‘i=knln⁑(X(i)/X(k)). \hat{s}_k = \frac{1}{\frac{1}{k}\sum_{i=k}^n \ln(X_{(i)}/X_{(k)})}.

Here, kk is the number of XiX_i larger than X(k)X_{(k)}, so (5) fits the definition of the censored estimator (4) with M=X(k)M = X_{(k)}. The crucial difference with (4) is that the threshold is now random.

We still have to explain how to choose kk. It turns out that many choices are possible, and they depend on nn, as stated by the following essential theorem.

Hill's estimator is consistent. Set k=knk=k_n, and suppose that kn/nβ†’0k_n / n \to 0 and knβ†’βˆžk_n \to \infty. If the XiX_i are iid samples from a heavy-tailed distribution with index ss in the general sense of (??), then s^kn\hat{s}_{k_n} converges in probability towards ss.

Proof in the case of Pareto random variables. If the XiX_i have the Pareto density 1[1,∞[sxβˆ’sβˆ’1\mathbf{1}_{[1,\infty[}sx^{-s-1}, then it is easy to check that Ei=ln⁑(Xi)E_i = \ln(X_i) are Exponential random variables with rate ss. The distribution of the ordered tuple (ln⁑(X(1)),…,ln⁑(X(n)))(\ln(X_{(1)}), \dotsc, \ln(X_{(n)})) is thus the distribution of (E(1),…,E(n))(E_{(1)}, \dotsc, E_{(n)}). This distribution has a very nice representation due to the exponential density. Indeed, for any test function Ο†\varphi,

E[Ο†(E(1),…,E(n))]=βˆ‘ΟƒE[Ο†(EΟƒ(1),…,EΟƒ(n))1EΟƒ(1)>β‹―>EΟƒ(n)]=βˆ‘Οƒβˆ«x1>β‹―>xnΟ†(x1,…,xn)eβˆ’sx1βˆ’β‹―βˆ’sxndx1…dxn=βˆ‘Οƒβˆ«u1>0,…,un>0Ο†(un+β‹―+u1,…,unβˆ’1+un,un)eβˆ’sunβˆ’(sun+sunβˆ’1)βˆ’β€¦βˆ’(sun+β‹―+su1)du1…dun=n!∫u1>0,…,un>0Ο†(un+β‹―+u1,…,unβˆ’1+un,un)eβˆ’nsunβˆ’(nβˆ’1)sunβˆ’1βˆ’β€¦βˆ’su1du1…dun=E[Ο†(Fn+β‹―+F1,…,Fnβˆ’1+Fn,Fn)]\begin{aligned} \mathbb{E}[\varphi(E_{(1)}, \dotsc, E_{(n)})]&= \sum_{\sigma}\mathbb{E}[\varphi(E_{\sigma(1)}, \dotsc, E_{\sigma(n)})\mathbf{1}_{E_{\sigma(1)}>\dotsb > E_{\sigma(n)}}]\\ &= \sum_\sigma \int_{x_1>\dotsb>x_n} \varphi(x_1, \dotsc, x_n)e^{-sx_1 - \dotsb - sx_n}dx_1\dots dx_n \\ &= \sum_\sigma \int_{u_1>0, \dotsc, u_n>0} \varphi(u_n + \dotsb + u_1, \dotsc, u_{n-1} + u_n, u_n)e^{-su_n - (su_n + su_{n-1}) - \dotsc -(su_n + \dotsb + su_1)}du_1… du_n \\ &= n! \int_{u_1>0, \dotsc, u_n>0} \varphi(u_n + \dotsb + u_1, \dotsc, u_{n-1} + u_n, u_n)e^{-ns u_n - (n-1) su_{n-1} - \dotsc - su_1}du_1… du_n \\ &= \mathbb{E}[\varphi(F_n + \dotsb + F_1, \dotsc, F_{n-1} + F_n, F_n)] \end{aligned}
where the FiF_i are independent exponential random variables, Fi∼E(si)F_i \sim \mathscr{E}(si). Moreover, if E∼E(s)E \sim \mathscr{E}(s) then E/λ∼E(sλ)E/\lambda \sim \mathscr{E}(s\lambda): since the EiE_i are all iid we thus have Fi=lawEi/iF_i \stackrel{\mathrm{law}}{=}E_{i}/i. In the end, we proved that

(E(1),…,E(n))=law(βˆ‘i=1nEii,…,Enβˆ’1nβˆ’1+Enn,Enn,),(E_{(1)}, \dotsc, E_{(n)}) \stackrel{\mathrm{law}}{=}\left( \sum_{i=1}^{n}\frac{E_i}{i}, \dotsc, \frac{E_{n-1}}{n-1} + \frac{E_{n}}{n}, \frac{E_n}{n}, \right),

in other words E(i)E_{(i)} has the same distribution as βˆ‘j=inEj/j\sum_{j=i}^n E_j/j. As a consequence,

1kβˆ‘i=1kln⁑(X(i)/X(k))=1kβˆ‘i=1kE(i)βˆ’E(k)=law1kβˆ‘i=1kβˆ‘j=i+1nEjj=1kβˆ‘j=1kβˆ’1Ej.\begin{aligned} \frac{1}{k}\sum_{i=1}^k \ln(X_{(i)}/X_{(k)}) &= \frac{1}{k}\sum_{i=1}^k E_{(i)} - E_{(k)} \stackrel{\mathrm{law}}{=}\frac{1}{k}\sum_{i=1}^k \sum_{j=i+1}^{n} \frac{E_j}{j} = \frac{1}{k}\sum_{j=1}^{k-1}E_j. \end{aligned}
As long as k=knβ†’βˆžk=k_n \to \infty, this is the empirical mean of E(s)\mathscr{E}(s) random variables, and it converges towards their common mean 1/s1/s as requested. Note that with this representation, we can also infer a CLT.

Proof in the general case. It is rougly the same idea, but with more technicalities coming from the properties of slowly varying functions. To give the sketch, if FF is the cdf of the law of the XiX_i, then we can write Xi=Fβˆ’1(Ui)X_i = F^{-1}(U_i) where the UiU_i are uniform on [0,1][0,1]. It turns out that if FF satisfies (??), then its inverse can be written as Fβˆ’1(u)=(1βˆ’u)βˆ’1/sβ„“(1/(1βˆ’u))F^{-1}(u) = (1-u)^{-1/s}\ell(1/(1-u)) – this is a difficult result due to Karamata. Then, log⁑Fβˆ’1(Xi)\log F^{-1}(X_i) can be represented as

sβˆ’1log⁑1/Ui+log⁑ℓ(1/Ui). s^{-1}\log 1/U_i + \log \ell(1/U_i).

The first term is precisely an E(s)\mathscr{E}(s) random variable, hence the analysis of the Pareto case holds. Then, we must show that the second term, when summed over i=1,…,ki=1, \dotsc, k, does not contribute. This is again a consequence of slow variation.

Typically, knk_n could be ⌊ln⁑nβŒ‹ \lfloor \ln n \rfloor or ⌊nβŒ‹\lfloor \sqrt{n}\rfloor or even ⌊(log⁑log⁑log⁑n)2βŒ‹\lfloor (\log \log \log n)^2\rfloor. That leaves many possible choices. In practice, instead of picking one such kk, statisticians prefer plotting s^k\hat{s}_k for all the possible kk and observe the general shape of the plot of k↦s^kk\mapsto \hat{s}_k, which is called Hill's plot. Of course, Hill's theorem above says that most of the information on ss is contained in a small region at the beginning of the plot, a region where kk is small with respect to nn (ie k/nβ†’0k/n \to 0), but not too small (ie kβ†’βˆžk\to\infty). It often happens that the Hill plot is almost constant on this small region, which can be seen for example in the second plot.

There's a whole science for reading Hill plots.

It turns out that the three datasets I created for this post are as follows:

References


[1] It is easy to see that if the XiX_i come from a mixture of two Pareto distributions with different parameters s1,s2s_1, s_2, then the censored MLE estimator will not converge towards the true tail index wich is min⁑{s1,s2}\min\{s_1, s_2\}.