Sampling refers to the generation of random variables following a certain probability distribution; for example if is a probability measure on , we want to generate random variables which are independent and follow the distribution given by , or we want to compute expectations like for some function , where . In many cases, one does not fully knows , but only that it is proportional to some function , that is where , and computing (the normalizing constant) is intractable.
There are many techniques that still allow to sample from in this case; in this note I'm focusing on importance sampling (IS), also called reweighting. In this note I prove an insightful result by Chatterjee and Diaconis on the number of samples required for IS to be sufficiently precise.
Let be probability measure, easy to sample from, with density . Let and (in the sequel, will always have density and density ). Then, for any function ,
Applying this formula to also yields . Consequently,
This suggests the following method for approximating without computing .
be iid with law
. Then, when
, almost surely one has
Proof. By the Law of Large Numbers, the LHS of (3) converges towards .
Also by the LLN, converges towards .
Take the ratio of the two to get (4).
This technique explains the word importance sampling: the samples are iid but their density is , not , and the weights correct the difference between the two. If some is very likely under but not so much under (meaning that is close to zero but is high), then this sample will be assigned a very small weight.
Unless and are already very close to each other, this can be pretty bad. Let's look at the variance of the estimator given in (3): clearly , hence
The first term could be prohibitively big. Indeed, let us consider the following situation: our prior density is a Standard Gaussian , and our target density is the same, but shifted by 10: . Here both densities are normalized, so . But,
That is too big. Having a low variance for would require having more than samples, which is impossible.
Practically, a good indicator of the quality of our importance sampling is given by the effective sample size. Suppose you used samples . If they were distributed exactly as , that is if was proportional to , we would have for all . In general is not proportional to hence that's not the case and one way to measure how far the samples we have are from samples of , we set = the empirical coefficient of variation of the , and
Suppose that you use IS with a sample size of . If then the weigths are almost constant and there's a good chance that our sampling is excellent. If it's small, say , then it means that the quality of your IS estimator is the same as if you would have used 100 samples of the real distribution .
This method is a good rule of thumb to assess the quality of IS, but it's quite empirical. In general, what is the required number of samples when one wants to efficiently estimate a mean using importance sampling?
be the Kullback-Leibler divergence from to (it is supposed to be ). It can also be defined as . In general, concentration inequalities allow to bound how much is concentrated around its mean : depending on , the deviation probability
should be a decreasing function of with . The error terms in the sequel will be expressed using this function . Roughly speaking, is very small. Note that I chose the two-sided deviation probability, mostly for simplicity.
For any function we'll note the estimator on the RHS of (4), that is
We also note the integral and the L2-norm.
Theorem (Chatterjee and Diaconis, 2015).
Let be any positive integer and .
Positive part. Suppose that . Then, for any , Negative part. Conversely if , there is a function such that but
The proof almost entirely relies on the following lemma, in which we have set .
As noted earlier we have , hence it is natural to choose at the same scale as . We take . The ratio becomes and . Overall, the RHS of becomes equal to where
We will prove this lemma later. From now on, let us prove the theorem.
Proof of the positive part. Apply Markov's inequality and the Lemma to the constant function and to the function . We get If and then we see that Now choose and . The probability of having at least one of the two events in (15) is smaller than by the union bound. Moreover, since by the Cauchy-Schwarz inequality, we get that
and the result holds.
Proof of the negative part. Use the Lemma with and set , so that . By the Markov inequality, . If every has , then clearly . By the union bound we thus have
Set . Then,
– Term is equal to . By the Cauchy-Schwarz inequality it is smaller than
– Term is smaller than hence its expectation is smaller than and we use the same trick as (20).
– Term is bounded as follows:
We gather the three bounds and get (13).
The paper by Chatterjee and Diaconis, with a very nice application to an old remark by Knuth on self-avoiding walks.
A nice paper on the Effective Sample Size by Jun S Liu, with a theoretical justification on why the ESS is significant and a comparison with other sampling techniques.