In the preceding note on heavy tails, we defined heavy-tailed distributions as distributions for which the excess function P(X>x) decays polynomially fast towards zero, and not exponentially fast (as for Gaussians for example). Formally, we asked
P(X>x)=xsc(x)β
where c(x) is some unspecified function with slow variation (ie, at infinity it is almost a constant, like logloglog(x)) and s>0 is called the heavy-tail index.
This raises two questions, which are statistical in nature. Suppose you are given samples X1β,β¦,Xnβ from a same distribution.
How do check if their common distribution is heavy-tailed?
How do you estimate its tail index s?
In this post, we'll focus on the second question, which in practice often solves the first one. For illustration purposes, I generated three datasets whose histograms (in log-log scales) are the following ones:Β
The first one does not really look heavy-tailed. The two others sure look like heavy tails. But what could possibly be their index?
In a histogram, we count the number of observations on small bins, say [kΞ΅,(k+1)Ξ΅]. The proportion of observations in any bin [kΞ΅,(k+1)Ξ΅] should be G((k+1)Ξ΅)βG(kΞ΅)βGβ²(kΞ΅)Ξ΅βΞ΅/(kΞ΅)s+1. The tails of the histogram should thus decay like kβsβ1. We can also use this idea to estimate s:Β suppose that p(k) is the height of the bin [kΞ΅,(k+1)Ξ΅]. Then we should have p(k)βc/ks+1 and so logp(k)βcβ(s+1)log(k). With a linear regression of logp(k) on log(k) we get an estimate for s+1.
There is a better way:Β instead of drawing histograms, one might draw "rank plots":Β for every kΞ΅, plot the number of observations larger than kΞ΅. These plots are less sensitive to noise in the observations.
Again, we can use this idea to estimate the tail-index s. Indeed, if Q(k) is the proportion of observations greater than kΞ΅, then we should have Q(k)βG(kΞ΅)βc/(kΞ΅)s, or logQ(k)βcstβslog(k). A linear regression would give another estimate, directly for s.
If we suspect that the Xiβ were generated from a specific parametric family of densities (Pareto, Weibull, etc.), we can try to estimate the tail-index using methods from classical parametric statistics, like Maximum Likelihood. Suppose we know that the Xiβ come from a Pareto distribution, whose density is fsβ(x)=sxβsβ1 over [1,β[. The likelihood of the observations is then
β(s)=i=1βnβln(s)β(s+1)ln(xiβ)
which is maximized when n/s=βln(xiβ). Consequently, the MLEΒ estimator for s is:Β
s^=n1ββi=1nβln(xiβ)1β.
If the Xiβ were truly generated by a Pareto model, this estimator has good properties: it is consistent and asymptotically Gaussian, indeed nβ(s^βs) is approximately N(0,s2) when n is large. This opens the door to statistical significance tests.
For the three datasets showed earlier, the MLEΒ estimators are:
| 1.7 | 0.43 | 1.1 |
Up to now, with three different estimators, we had very different estimates for sβ¦
But what if my data are not Pareto-distributed? A possible approach goes as follows. Suppose that the Xiβ follow some heavy-tailed distribution with density f and that f(x)βΌcxβsβ1 for some constant c. Then it means that, for large x, the density f is almost the same as a Pareto. Hence, we could simply forget about the Xiβ which are too small and only keep the large ones above some threshold M (after all, they are the ones responsible for the tail behaviour), then apply the MLEΒ estimator. This is called censorship;Β typically, if N is the number of observations above the chosen threshold, then
it is biased and/or not consistent for many models[1].
It is extremely sensitive to the choice of the threshold M.
To avoid these pitfalls, the idea is to play with various levels of censorship M;Β that is, to choose a threshold Mnβ which grows larger with n. This leads to the Hill estimator.
What would be the best choice for M in (4)? First, we note that s^Mβ undergoes discontinuous changes when M becomes equal to one of the Xiβ, since in this case one term is added to the sum and the number of indices such that Xkβ>M is incremented by 1. It is thus natural to replace M with the largest samples, say X(1)β,X(2)ββ¦ where X(1)β>β¦>X(n)β are the ordered samples:
s^kβ=k1ββi=knβln(X(i)β/X(k)β)1β.
Here, k is the number of Xiβ larger than X(k)β, so (5) fits the definition of the censored estimator (4) with M=X(k)β. The crucial difference with (4) is that the threshold is now random.
We still have to explain how to choose k. It turns out that many choices are possible, and they depend on n, as stated by the following essential theorem.
Hill's estimator is consistent. Set k=knβ, and suppose that knβ/nβ0 and knβββ. If the Xiβ are iid samples from a heavy-tailed distribution with index s in the general sense of (??), then s^knββ converges in probability towards s.
Proof in the case of Pareto random variables. If the Xiβ have the Pareto density 1[1,β[βsxβsβ1, then it is easy to check that Eiβ=ln(Xiβ) are Exponential random variables with rate s. The distribution of the ordered tuple (ln(X(1)β),β¦,ln(X(n)β)) is thus the distribution of (E(1)β,β¦,E(n)β). This distribution has a very nice representation due to the exponential density. Indeed, for any test function Ο,
where the Fiβ are independent exponential random variables, FiββΌE(si). Moreover, if EβΌE(s) then E/Ξ»βΌE(sΞ»):Β since the Eiβ are all iid we thus have Fiβ=lawEiβ/i. In the end, we proved that
(E(1)β,β¦,E(n)β)=law(i=1βnβiEiββ,β¦,nβ1Enβ1ββ+nEnββ,nEnββ,),
in other words E(i)β has the same distribution as βj=inβEjβ/j. As a consequence,
As long as k=knβββ, this is the empirical mean of E(s) random variables, and it converges towards their common mean 1/s as requested. Note that with this representation, we can also infer a CLT.
Proof in the general case. It is rougly the same idea, but with more technicalities coming from the properties of slowly varying functions. To give the sketch, if F is the cdf of the law of the Xiβ, then we can write Xiβ=Fβ1(Uiβ) where the Uiβ are uniform on [0,1]. It turns out that if F satisfies (??), then its inverse can be written as Fβ1(u)=(1βu)β1/sβ(1/(1βu)) β this is a difficult result due to Karamata. Then, logFβ1(Xiβ) can be represented as
sβ1log1/Uiβ+logβ(1/Uiβ).
The first term is precisely an E(s) random variable, hence the analysis of the Pareto case holds. Then, we must show that the second term, when summed over i=1,β¦,k, does not contribute. This is again a consequence of slow variation.
Typically, knβ could be βlnnβ or βnββ or even β(logloglogn)2β. That leaves many possible choices. In practice, instead of picking one such k, statisticians prefer plotting s^kβ for all the possible k and observe the general shape of the plot of kβ¦s^kβ, which is called Hill's plot. Of course, Hill's theorem above says that most of the information on s is contained in a small region at the beginning of the plot, a region where k is small with respect to n (ie k/nβ0), but not too small (ie kββ). It often happens that the Hill plot is almost constant on this small region, which can be seen for example in the second plot.
There's a whole science for reading Hill plots.
It turns out that the three datasets I created for this post are as follows:
the first one is an exponential distribution with parameter 1. It has a light tail.
the second one is a Pareto distribution with index s=0.5.
the third one is (the absolute of) a Cauchy distribution, therefore its index is s=1.
[1] It is easy to see that if the Xiβ come from a mixture of two Pareto distributions with different parameters s1β,s2β, then the censored MLEΒ estimator will not converge towards the true tail index wich is min{s1β,s2β}.