Importance sampling ⚖️

June 2023

Sampling refers to the generation of random variables following a certain probability distribution; for example if ρ\rho is a probability measure on R\mathbb{R}, we want to generate random variables which are independent and follow the distribution given by ρ\rho, or we want to compute expectations like E[φ(Y)]\mathbb{E}[\varphi(Y)] for some function φ\varphi, where YρY \sim \rho. In many cases, one does not fully knows ρ\rho, but only that it is proportional to some function ff, that is ρ(x)=f(x)/Z\rho(x) = f(x)/Z where Z=f(u)duZ = \int f(u)du, and computing ZZ (the normalizing constant) is intractable.

There are many techniques that still allow to sample from ρ\rho in this case; in this note I'm focusing on importance sampling (IS), also called reweighting. In this note I prove an insightful result by Chatterjee and Diaconis on the number of samples required for IS to be sufficiently precise.

The basic idea of Importance sampling

Let ν\nu be probability measure, easy to sample from, with density gg. Let YρY \sim \rho and XνX\sim \nu (in the sequel, YiY_i will always have density ρ\rho and XiX_i density gg). Then, for any function φ\varphi,

E[φ(X)f(X)/g(X)]=φ(x)f(x)g(x)g(x)dx=φ(x)f(x)dx=ZE[φ(Y)]. \mathbb{E}[\varphi(X)f(X)/g(X)] = \int \varphi(x)\frac{f(x)}{g(x)}g(x)dx = \int \varphi(x)f(x)dx = Z \mathbb{E}[\varphi(Y)].

Applying this formula to φ1\varphi \equiv 1 also yields E[f(X)/g(X)]=Z\mathbb{E}[f(X)/g(X)] =Z. Consequently,

E[φ(X)f(X)/g(X)]E[f(X)/g(X)]=E[φ(Y)].\frac{ \mathbb{E}[\varphi(X)f(X)/g(X)] }{ \mathbb{E}[f(X)/g(X)] } = \mathbb{E}[\varphi(Y)].

This suggests the following method for approximating E[φ(Y)]\mathbb{E}[\varphi(Y)] without computing ZZ.

Let X1,,XnX_1, \dotsc, X_n be iid with law ν\nu. Set w(x)=f(x)/g(x)w(x) = f(x)/g(x). Then, when nn\to\infty, almost surely one has i=1nw(Xi)nZ \frac{\sum_{i=1}^n w(X_i)}{n} \to Z and i=1nw(Xi)φ(Xi)i=1nw(Xi)E[φ(Y)]=Φ(x)f(x)Zdx. \frac{\sum_{i=1}^n w(X_i) \varphi(X_i)}{\sum_{i=1}^n w(X_i)} \to \mathbb{E}[\varphi(Y)]= \int \Phi(x)\frac{f(x)}{Z}dx.

Proof. By the Law of Large Numbers, the LHS of (3) converges towards E[w(X)]=w(x)g(x)dx=f(x)dx=Z\mathbf{E}[w(X)] = \int w(x)g(x)dx = \int f(x)dx = Z.

Also by the LLN, w(Xi)φ(Xi)n\frac{\sum w(X_i)\varphi(X_i)}{n} converges towards E[φ(X)w(X)]=ZE[φ(Y)]\mathbf{E}[\varphi(X)w(X)] = Z\mathbf{E}[\varphi(Y)].

Take the ratio of the two to get (4).

This technique explains the word importance sampling: the samples XiX_i are iid but their density is g(x)g(x), not f(x)/Zf(x)/Z, and the weights correct the difference between the two. If some xx is very likely under gg but not so much under ff (meaning that f(x)/Zf(x)/Z is close to zero but g(x)g(x) is high), then this sample will be assigned a very small weight.

How good is this approximation?

Unless ρ\rho and ν\nu are already very close to each other, this can be pretty bad. Let's look at the variance of the estimator Z^\hat{Z} given in (3): clearly E[Z^]=Z\mathbf{E}[\hat{Z}]=Z, hence

Var(Z^)=g(x)f(x)2g(x)2dxZ2n=f(x)f(x)g(x)dxZ2n=ZE[f(Y)/g(Y)]Z2n. \mathrm{Var}(\hat{Z}) = \frac{\int g(x) \frac{f(x)^2}{g(x)^2}dx - Z^2}{n} = \frac{\int f(x) \frac{f(x)}{g(x)}dx - Z^2}{n} = \frac{Z\mathbf{E}[f(Y)/g(Y)]-Z^2}{n}.

The first term could be prohibitively big. Indeed, let us consider the following situation: our prior density is a Standard Gaussian g(x)=ex2/2/2πg(x) = e^{-x^2/2}/\sqrt{2\pi}, and our target density is the same, but shifted by 10: f(x)=e(x10)2/2/2πf(x) = e^{-(x-10)^2/2}/\sqrt{2\pi}. Here both densities are normalized, so Z=1Z=1. But,

E[f(Y)/g(Y)]=12πe(x10)22e(x10)22+x22dx=12πe(x20)22+100dx=e100.\mathbf{E}[f(Y)/g(Y)] = \frac{1}{\sqrt{2\pi}}\int e^{-\frac{(x-10)^2}{2}}e^{-\frac{(x-10)^2 }{ 2} + \frac{x^2}{2}}dx = \frac{1}{\sqrt{2\pi}}\int e^{- \frac{(x-20)^2}{2} + 100}dx = e^{100}.

That is too big. Having a low variance for Z^\hat{Z} would require having more than e100e^{100} samples, which is impossible.

The Effective Sample Size

Practically, a good indicator of the quality of our importance sampling is given by the effective sample size. Suppose you used nn samples XiX_i. If they were distributed exactly as ρ\rho, that is if gg was proportional to ff, we would have w(Xi)=1/Zw(X_i) = 1/Z for all XiX_i. In general ff is not proportional to gg hence that's not the case and one way to measure how far the samples we have are from nn samples of ρ\rho, we set σ^n\hat{\sigma}_n = the empirical coefficient of variation of the w(Xi)w(X_i), and

ESS(n)=n1+σ^n. \mathrm{ESS}(n) = \frac{n}{1 + \hat{\sigma}_n}.

How to use the ESS ?

Suppose that you use IS with a sample size of n=1000n=1000. If ESS1000\mathrm{ESS} \approx 1000 then the weigths are almost constant and there's a good chance that our sampling is excellent. If it's small, say ESS100\mathrm{ESS} \approx 100, then it means that the quality of your IS estimator is the same as if you would have used 100 samples of the real distribution ρ\rho.

This method is a good rule of thumb to assess the quality of IS, but it's quite empirical. In general, what is the required number of samples when one wants to efficiently estimate a mean E[φ(Y)]\mathbf{E}[\varphi(Y)] using importance sampling?

The number of samples required for IS


D=dKL(fg)=log(f(x)/g(x))f(x)dxD = d_{\mathrm{KL}}(f\mid g) = \int \log (f(x)/g(x)) f(x)dx

be the Kullback-Leibler divergence from ρ\rho to ν\nu (it is supposed to be <<\infty). It can also be defined as D=E[logw(Y)]D=\mathbf{E}[\log w(Y)]. In general, concentration inequalities allow to bound how much logw(Y)\log w(Y) is concentrated around its mean DD: depending on ν,μ\nu,\mu, the deviation probability

r(s)=P(logw(Y)D>s)r(s)=\mathbf{P}(|\log w(Y) - D| > s)

should be a decreasing function of ss with r(s)0r(s)\to 0. The error terms in the sequel will be expressed using this function rr. Roughly speaking, r(s)r(s) is very small. Note that I chose the two-sided deviation probability, mostly for simplicity.

For any function φ\varphi we'll note Jn(φ)J_n(\varphi) the estimator on the RHS of (4), that is

Jn(φ)=i=1nw(Xi)φ(Xi)i=1nw(Xi). J_n(\varphi) = \frac{\sum_{i=1}^n w(X_i) \varphi(X_i)}{\sum_{i=1}^n w(X_i)}.

We also note I(φ)=φ(x)ρ(x)dx=E[φ(Y)]I(\varphi) = \int \varphi(x)\rho(x)dx = \mathbf{E}[\varphi(Y)] the integral and φ22=φ(x)2ρ(x)dx|\varphi|^2_2 = \int \varphi(x)^2 \rho(x)dx the L2-norm.

Theorem (Chatterjee and Diaconis, 2015).

Let ss be any positive integer and ε(s)=(es/4+2r(s/2))1/2\varepsilon(s) = (e^{-s/4} + 2\sqrt{r(s/2)})^{1/2}.

Positive part. Suppose that n=eD+sn=e^{D+s}. Then, for any φ\varphi, P(Jn(φ)E[φ(Y)]>ε(s)φ2)2ε(s).\mathbf{P}(|J_n(\varphi) - \mathbf{E}[\varphi(Y)]|>\varepsilon(s) |\varphi|_2)\leqslant 2\varepsilon(s). Negative part. Conversely if n=eDsn=e^{D-s}, there is a function φ\varphi such that E[φ(Y)]r(s/2)\mathbf{E}[\varphi(Y)]\leqslant r(s/2) but

P(Jn(φ)=1)1es/2.\mathbf{P}(J_n(\varphi)=1) \geqslant 1 - e^{-s/2}.

The proof almost entirely relies on the following lemma, in which we have set In(φ)=i=1nφ(Xi)w(Xi)nZI_n(\varphi)= \frac{\sum_{i=1}^n \varphi(X_i)w(X_i)}{nZ}.

Lemma. For any λ>0\lambda >0, E[In(φ)I(φ)]φ2(λn+2P(w(Y)>λ)).\mathbf{E}[|I_n(\varphi) - I(\varphi)|] \leqslant |\varphi|_{2}\left(\sqrt{\frac{\lambda}{n}} + 2\mathbf{P}(w(Y)>\lambda)\right).

As noted earlier we have E[logw(Y)]=f(x)log(f(x)/g(x))dx=D\mathbf{E}[\log w(Y)] = \int f(x)\log(f(x)/g(x))dx = D, hence it is natural to choose logλ\log \lambda at the same scale as DD. We take λ=eD+s/4\lambda = e^{D + s/4}. The ratio λ/n\lambda/n becomes es/4e^{-s/4} and P(w(Y)>λ)=r(s)\mathbf{P}(w(Y)>\lambda)=r(s). Overall, the RHS of becomes equal to φ2×ε(s)2 |\varphi|_2 \times \varepsilon(s)^2 where

ε(s)=es/4+r(s). \varepsilon(s) = \sqrt{e^{-s/4} + r(s)}.

We will prove this lemma later. From now on, let us prove the theorem.

Proof of the positive part. Apply Markov's inequality and the Lemma to the constant function 11 and to the function φ\varphi. We get P(In(1)1>δ)ε(s)2/δandP(In(φ)I(φ)>δ)φ2ε(s)2/δ.\begin{aligned} &\mathbf{P}(|I_n(1) - 1|>\delta) \leqslant \varepsilon(s)^2/\delta &&\text{and}&&\mathbf{P}(|I_n(\varphi) - I(\varphi)|>\delta') \leqslant |\varphi|_2 \varepsilon(s)^2/\delta'.\end{aligned} If In(1)1<δ|I_n(1)-1|<\delta and In(φ)I(φ)<δ|I_n(\varphi) - I(\varphi)|<\delta' then we see that Jn(φ)I(φ)=In(φ)/In(1)I(φ)In(φ)I(φ)+I(φ)In(1)1In(1)δ+δI(φ)1δ\begin{aligned}|J_n(\varphi) - I(\varphi)| = |I_n(\varphi)/I_n(1) - I(\varphi)|&\leqslant \frac{|I_n(\varphi) - I(\varphi)| + |I(\varphi)||I_n(1)-1|}{|I_n(1)|} \\ &\leqslant \frac{\delta' + \delta I(\varphi)}{1-\delta} \end{aligned} Now choose δ=ε(s)\delta = \varepsilon(s) and δ=φ2ε(s)\delta' = |\varphi|_2 \varepsilon(s). The probability of having at least one of the two events in (15) is smaller than 2ε(s)2\varepsilon(s) by the union bound. Moreover, since I(φ)φ2|I(\varphi)|\leqslant |\varphi|_2 by the Cauchy-Schwarz inequality, we get that

δ+δI(φ)1δ2ε(s)φ21ε(s)\frac{\delta' + \delta |I(\varphi)|}{1-\delta}\leqslant \frac{2\varepsilon(s)|\varphi|_2}{1 - \varepsilon(s)}

and the result holds.

Proof of the negative part. Use the Lemma with λ=eDs/2\lambda = e^{D - s/2} and set φ(x)=1w(x)λ\varphi(x) = \mathbf{1}_{w(x)\leqslant \lambda}, so that I(φ)=P(w(Y)λ)r(s/2)I(\varphi) = \mathbf{P}(w(Y)\leqslant \lambda) \leqslant r(s/2). By the Markov inequality, P(w(Y)>λ)E[w(Y)]/λ=1/λ\mathbf{P}(w(Y)>\lambda) \leqslant \mathbf{E}[w(Y)]/\lambda = 1/\lambda. If every XiX_i has w(Xi)λw(X_i)\leqslant \lambda, then clearly Jn(φ)=1J_n(\varphi)=1. By the union bound we thus have

P(Jn(φ)1)i=1nP(w(Xi)>λ)nλ=es/2. \mathbf{P}(J_n(\varphi)\neq 1) \leqslant \sum_{i=1}^n \mathbf{P}(w(X_i)>\lambda) \leqslant \frac{n}{\lambda}=e^{-s/2}.

Proof of the Lemma

Set ψ=φ1w(x)λ\psi = \varphi \mathbf{1}_{w(x)\leqslant \lambda}. Then,

In(φ)I(φ)In(φ)In(ψ)+In(ψ)I(ψ)+I(ψ)I(φ)=A+B+C |I_n(\varphi) - I(\varphi) |\leqslant | I_n(\varphi) - I_n(\psi) |+| I_n(\psi) - I(\psi)| + |I(\psi) - I(\varphi)| = A+B+C

– Term CC is equal to ρ(x)φ(x)1w(x)>λdx\int \rho(x)\varphi(x)\mathbb{1}_{w(x)> \lambda}dx. By the Cauchy-Schwarz inequality it is smaller than

φ(x)2ρ(x)dx×P(w(Y)>λ)=φ2P(w(Y)>λ). \sqrt{\int \varphi(x)^2 \rho(x)dx \times \mathbb{P}(w(Y)> \lambda)} = |\varphi|_{2}\sqrt{\mathbb{P}(w(Y)> \lambda)}.

– Term BB is smaller than 1nZw(Xi)φ(Xi)1w(Xi)>λ \frac{1}{nZ}\sum w(X_i)|\varphi(X_i)| \mathbf{1}_{w(X_i)>\lambda} hence its expectation is smaller than E[w(X)φ(X)1w(X)>λ]/Z=E[φ(Y)1w(Y)>λ]\mathbb{E}[w(X)|\varphi(X)|\mathbf{1}_{w(X)>\lambda}]/Z = \mathbb{E}[\varphi(Y)\mathbf{1}_{w(Y)>\lambda}] and we use the same trick as (20).

– Term AA is bounded as follows:

E[In(ψ)I(ψ)]E[In(ψ)I(ψ)2]=Var(In(ψ))ψ(x)2w(x)2g(x)dxnZw(x)λφ(x)2w(x)f(x)dxnZλφ(x)2ρ(x)dxn=λφ22/n.\begin{aligned} \mathbf{E}[|I_n(\psi) - I(\psi)|]&\leqslant \sqrt{\mathbf{E}[|I_n(\psi) - I(\psi)|^2]} = \mathrm{Var}(I_n(\psi))\\ &\leqslant\sqrt{\frac{\int \psi(x)^2w(x)^2g(x)dx}{nZ} }\\ &\leqslant \sqrt{ \frac{\int_{w(x)\leqslant \lambda} \varphi(x)^2 w(x) f(x) dx}{n Z}}\\ &\leqslant \sqrt{\lambda \frac{\int \varphi(x)^2 \rho(x) dx}{n} } = \sqrt{\lambda |\varphi|^2_{2}/n}.\\ \end{aligned}

We gather the three bounds and get (13).