Sampling refers to the generation of random variables following a certain probability distribution; for example if ρ is a probability measure on R, we want to generate random variables which are independent and follow the distribution given by ρ, or we want to compute expectations like E[φ(Y)] for some function φ, where Y∼ρ. In many cases, one does not fully knows ρ, but only that it is proportional to some function f, that is ρ(x)=f(x)/Z where Z=∫f(u)du, and computing Z (the normalizing constant) is intractable.
There are many techniques that still allow to sample from ρ in this case; in this note I'm focusing on importance sampling (IS), also called reweighting. In this note I prove an insightful result by Chatterjee and Diaconis on the number of samples required for IS to be sufficiently precise.
Let ν be probability measure, easy to sample from, with density g. Let Y∼ρ and X∼ν (in the sequel, Yi will always have density ρ and Xi density g). Then, for any function φ,
Applying this formula to φ≡1 also yields E[f(X)/g(X)]=Z. Consequently,
E[f(X)/g(X)]E[φ(X)f(X)/g(X)]=E[φ(Y)].
This suggests the following method for approximating E[φ(Y)] without computing Z.
Let X1,…,Xn be iid with law ν. Set w(x)=f(x)/g(x). Then, when n→∞, almost surely one has n∑i=1nw(Xi)→Z and ∑i=1nw(Xi)∑i=1nw(Xi)φ(Xi)→E[φ(Y)]=∫Φ(x)Zf(x)dx.
Proof. By the Law of Large Numbers, the LHS of (3) converges towards E[w(X)]=∫w(x)g(x)dx=∫f(x)dx=Z.
Also by the LLN, n∑w(Xi)φ(Xi) converges towards E[φ(X)w(X)]=ZE[φ(Y)].
This technique explains the word importance sampling: the samples Xi are iid but their density is g(x), not f(x)/Z, and the weights correct the difference between the two. If some x is very likely under g but not so much under f (meaning that f(x)/Z is close to zero but g(x) is high), then this sample will be assigned a very small weight.
Unless ρ and ν are already very close to each other, this can be pretty bad. Let's look at the variance of the estimator Z^ given in (3): clearly E[Z^]=Z, hence
The first term could be prohibitively big. Indeed, let us consider the following situation: our prior density is a Standard Gaussian g(x)=e−x2/2/2π, and our target density is the same, but shifted by 10: f(x)=e−(x−10)2/2/2π. Here both densities are normalized, so Z=1. But,
Practically, a good indicator of the quality of our importance sampling is given by the effective sample size. Suppose you used n samples Xi. If they were distributed exactly as ρ, that is if g was proportional to f, we would have w(Xi)=1/Z for all Xi. In general f is not proportional to g hence that's not the case and one way to measure how far the samples we have are from n samples of ρ, we set σ^n = the empirical coefficient of variation of the w(Xi), and
Suppose that you use IS with a sample size of n=1000. If ESS≈1000 then the weigths are almost constant and there's a good chance that our sampling is excellent. If it's small, say ESS≈100, then it means that the quality of your IS estimator is the same as if you would have used 100 samples of the real distribution ρ.
This method is a good rule of thumb to assess the quality of IS, but it's quite empirical. In general, what is the required number of samples when one wants to efficiently estimate a mean E[φ(Y)] using importance sampling?
be the Kullback-Leibler divergence from ρ to ν (it is supposed to be <∞). It can also be defined as D=E[logw(Y)]. In general, concentration inequalities allow to bound how much logw(Y) is concentrated around its mean D: depending on ν,μ, the deviation probability
r(s)=P(∣logw(Y)−D∣>s)
should be a decreasing function of s with r(s)→0. The error terms in the sequel will be expressed using this function r. Roughly speaking, r(s) is very small. Note that I chose the two-sided deviation probability, mostly for simplicity.
For any function φ we'll note Jn(φ) the estimator on the RHS of (4), that is
Jn(φ)=∑i=1nw(Xi)∑i=1nw(Xi)φ(Xi).
We also note I(φ)=∫φ(x)ρ(x)dx=E[φ(Y)] the integral and ∣φ∣22=∫φ(x)2ρ(x)dx the L2-norm.
Let s be any positive integer and ε(s)=(e−s/4+2r(s/2))1/2.
Positive part. Suppose that n=eD+s. Then, for any φ, P(∣Jn(φ)−E[φ(Y)]∣>ε(s)∣φ∣2)⩽2ε(s).Negative part. Conversely if n=eD−s, there is a function φ such that E[φ(Y)]⩽r(s/2) but
P(Jn(φ)=1)⩾1−e−s/2.
The proof almost entirely relies on the following lemma, in which we have set In(φ)=nZ∑i=1nφ(Xi)w(Xi).
Lemma. For any λ>0, E[∣In(φ)−I(φ)∣]⩽∣φ∣2(nλ+2P(w(Y)>λ)).
As noted earlier we have E[logw(Y)]=∫f(x)log(f(x)/g(x))dx=D, hence it is natural to choose logλ at the same scale as D. We take λ=eD+s/4. The ratio λ/n becomes e−s/4 and P(w(Y)>λ)=r(s). Overall, the RHS of becomes equal to ∣φ∣2×ε(s)2 where
ε(s)=e−s/4+r(s).
We will prove this lemma later. From now on, let us prove the theorem.
Proof of the positive part. Apply Markov's inequality and the Lemma to the constant function 1 and to the function φ. We get P(∣In(1)−1∣>δ)⩽ε(s)2/δandP(∣In(φ)−I(φ)∣>δ′)⩽∣φ∣2ε(s)2/δ′. If ∣In(1)−1∣<δ and ∣In(φ)−I(φ)∣<δ′ then we see that ∣Jn(φ)−I(φ)∣=∣In(φ)/In(1)−I(φ)∣⩽∣In(1)∣∣In(φ)−I(φ)∣+∣I(φ)∣∣In(1)−1∣⩽1−δδ′+δI(φ) Now choose δ=ε(s) and δ′=∣φ∣2ε(s). The probability of having at least one of the two events in (15) is smaller than 2ε(s) by the union bound. Moreover, since ∣I(φ)∣⩽∣φ∣2 by the Cauchy-Schwarz inequality, we get that
1−δδ′+δ∣I(φ)∣⩽1−ε(s)2ε(s)∣φ∣2
and the result holds.
Proof of the negative part. Use the Lemma with λ=eD−s/2 and set φ(x)=1w(x)⩽λ, so that I(φ)=P(w(Y)⩽λ)⩽r(s/2). By the Markov inequality, P(w(Y)>λ)⩽E[w(Y)]/λ=1/λ. If every Xi has w(Xi)⩽λ, then clearly Jn(φ)=1. By the union bound we thus have
– Term C is equal to ∫ρ(x)φ(x)1w(x)>λdx. By the Cauchy-Schwarz inequality it is smaller than
∫φ(x)2ρ(x)dx×P(w(Y)>λ)=∣φ∣2P(w(Y)>λ).
– Term B is smaller than nZ1∑w(Xi)∣φ(Xi)∣1w(Xi)>λ hence its expectation is smaller than E[w(X)∣φ(X)∣1w(X)>λ]/Z=E[φ(Y)1w(Y)>λ] and we use the same trick as (20).
The paper by Chatterjee and Diaconis, with a very nice application to an old remark by Knuth on self-avoiding walks.
A nice paper on the Effective Sample Size by Jun S Liu, with a theoretical justification on why the ESS is significant and a comparison with other sampling techniques.