# Importance sampling ⚖️

June 2023

Sampling refers to the generation of random variables following a certain probability distribution; for example if $\rho$ is a probability measure on $\mathbb{R}$, we want to generate random variables which are independent and follow the distribution given by $\rho$, or we want to compute expectations like $\mathbb{E}[\varphi(Y)]$ for some function $\varphi$, where $Y \sim \rho$. In many cases, one does not fully knows $\rho$, but only that it is proportional to some function $f$, that is $\rho(x) = f(x)/Z$ where $Z = \int f(u)du$, and computing $Z$ (the normalizing constant) is intractable.

There are many techniques that still allow to sample from $\rho$ in this case; in this note I'm focusing on importance sampling (IS), also called reweighting. In this note I prove an insightful result by Chatterjee and Diaconis on the number of samples required for IS to be sufficiently precise.

## The basic idea of Importance sampling

Let $\nu$ be probability measure, easy to sample from, with density $g$. Let $Y \sim \rho$ and $X\sim \nu$ (in the sequel, $Y_i$ will always have density $\rho$ and $X_i$ density $g$). Then, for any function $\varphi$,

$\mathbb{E}[\varphi(X)f(X)/g(X)] = \int \varphi(x)\frac{f(x)}{g(x)}g(x)dx = \int \varphi(x)f(x)dx = Z \mathbb{E}[\varphi(Y)].$

Applying this formula to $\varphi \equiv 1$ also yields $\mathbb{E}[f(X)/g(X)] =Z$. Consequently,

$\frac{ \mathbb{E}[\varphi(X)f(X)/g(X)] }{ \mathbb{E}[f(X)/g(X)] } = \mathbb{E}[\varphi(Y)].$

This suggests the following method for approximating $\mathbb{E}[\varphi(Y)]$ without computing $Z$.

Let $X_1, \dotsc, X_n$ be iid with law $\nu$. Set $w(x) = f(x)/g(x)$. Then, when $n\to\infty$, almost surely one has $\frac{\sum_{i=1}^n w(X_i)}{n} \to Z$ and $\frac{\sum_{i=1}^n w(X_i) \varphi(X_i)}{\sum_{i=1}^n w(X_i)} \to \mathbb{E}[\varphi(Y)]= \int \Phi(x)\frac{f(x)}{Z}dx.$

Proof. By the Law of Large Numbers, the LHS of (3) converges towards $\mathbf{E}[w(X)] = \int w(x)g(x)dx = \int f(x)dx = Z$.

Also by the LLN, $\frac{\sum w(X_i)\varphi(X_i)}{n}$ converges towards $\mathbf{E}[\varphi(X)w(X)] = Z\mathbf{E}[\varphi(Y)]$.

Take the ratio of the two to get (4).

This technique explains the word importance sampling: the samples $X_i$ are iid but their density is $g(x)$, not $f(x)/Z$, and the weights correct the difference between the two. If some $x$ is very likely under $g$ but not so much under $f$ (meaning that $f(x)/Z$ is close to zero but $g(x)$ is high), then this sample will be assigned a very small weight.

## How good is this approximation?

Unless $\rho$ and $\nu$ are already very close to each other, this can be pretty bad. Let's look at the variance of the estimator $\hat{Z}$ given in (3): clearly $\mathbf{E}[\hat{Z}]=Z$, hence

$\mathrm{Var}(\hat{Z}) = \frac{\int g(x) \frac{f(x)^2}{g(x)^2}dx - Z^2}{n} = \frac{\int f(x) \frac{f(x)}{g(x)}dx - Z^2}{n} = \frac{Z\mathbf{E}[f(Y)/g(Y)]-Z^2}{n}.$

The first term could be prohibitively big. Indeed, let us consider the following situation: our prior density is a Standard Gaussian $g(x) = e^{-x^2/2}/\sqrt{2\pi}$, and our target density is the same, but shifted by 10: $f(x) = e^{-(x-10)^2/2}/\sqrt{2\pi}$. Here both densities are normalized, so $Z=1$. But,

$\mathbf{E}[f(Y)/g(Y)] = \frac{1}{\sqrt{2\pi}}\int e^{-\frac{(x-10)^2}{2}}e^{-\frac{(x-10)^2 }{ 2} + \frac{x^2}{2}}dx = \frac{1}{\sqrt{2\pi}}\int e^{- \frac{(x-20)^2}{2} + 100}dx = e^{100}.$

That is too big. Having a low variance for $\hat{Z}$ would require having more than $e^{100}$ samples, which is impossible.

### The Effective Sample Size

Practically, a good indicator of the quality of our importance sampling is given by the effective sample size. Suppose you used $n$ samples $X_i$. If they were distributed exactly as $\rho$, that is if $g$ was proportional to $f$, we would have $w(X_i) = 1/Z$ for all $X_i$. In general $f$ is not proportional to $g$ hence that's not the case and one way to measure how far the samples we have are from $n$ samples of $\rho$, we set $\hat{\sigma}_n$ = the empirical coefficient of variation of the $w(X_i)$, and

$\mathrm{ESS}(n) = \frac{n}{1 + \hat{\sigma}_n}.$

### How to use the ESS ?

Suppose that you use IS with a sample size of $n=1000$. If $\mathrm{ESS} \approx 1000$ then the weigths are almost constant and there's a good chance that our sampling is excellent. If it's small, say $\mathrm{ESS} \approx 100$, then it means that the quality of your IS estimator is the same as if you would have used 100 samples of the real distribution $\rho$.

This method is a good rule of thumb to assess the quality of IS, but it's quite empirical. In general, what is the required number of samples when one wants to efficiently estimate a mean $\mathbf{E}[\varphi(Y)]$ using importance sampling?

## The number of samples required for IS

Let

$D = d_{\mathrm{KL}}(f\mid g) = \int \log (f(x)/g(x)) f(x)dx$

be the Kullback-Leibler divergence from $\rho$ to $\nu$ (it is supposed to be $<\infty$). It can also be defined as $D=\mathbf{E}[\log w(Y)]$. In general, concentration inequalities allow to bound how much $\log w(Y)$ is concentrated around its mean $D$: depending on $\nu,\mu$, the deviation probability

$r(s)=\mathbf{P}(|\log w(Y) - D| > s)$

should be a decreasing function of $s$ with $r(s)\to 0$. The error terms in the sequel will be expressed using this function $r$. Roughly speaking, $r(s)$ is very small. Note that I chose the two-sided deviation probability, mostly for simplicity.

For any function $\varphi$ we'll note $J_n(\varphi)$ the estimator on the RHS of (4), that is

$J_n(\varphi) = \frac{\sum_{i=1}^n w(X_i) \varphi(X_i)}{\sum_{i=1}^n w(X_i)}.$

We also note $I(\varphi) = \int \varphi(x)\rho(x)dx = \mathbf{E}[\varphi(Y)]$ the integral and $|\varphi|^2_2 = \int \varphi(x)^2 \rho(x)dx$ the L2-norm.

Theorem (Chatterjee and Diaconis, 2015).

Let $s$ be any positive integer and $\varepsilon(s) = (e^{-s/4} + 2\sqrt{r(s/2)})^{1/2}$.

Positive part. Suppose that $n=e^{D+s}$. Then, for any $\varphi$, $\mathbf{P}(|J_n(\varphi) - \mathbf{E}[\varphi(Y)]|>\varepsilon(s) |\varphi|_2)\leqslant 2\varepsilon(s).$ Negative part. Conversely if $n=e^{D-s}$, there is a function $\varphi$ such that $\mathbf{E}[\varphi(Y)]\leqslant r(s/2)$ but

$\mathbf{P}(J_n(\varphi)=1) \geqslant 1 - e^{-s/2}.$

The proof almost entirely relies on the following lemma, in which we have set $I_n(\varphi)= \frac{\sum_{i=1}^n \varphi(X_i)w(X_i)}{nZ}$.

Lemma. For any $\lambda >0$, $\mathbf{E}[|I_n(\varphi) - I(\varphi)|] \leqslant |\varphi|_{2}\left(\sqrt{\frac{\lambda}{n}} + 2\mathbf{P}(w(Y)>\lambda)\right).$

As noted earlier we have $\mathbf{E}[\log w(Y)] = \int f(x)\log(f(x)/g(x))dx = D$, hence it is natural to choose $\log \lambda$ at the same scale as $D$. We take $\lambda = e^{D + s/4}$. The ratio $\lambda/n$ becomes $e^{-s/4}$ and $\mathbf{P}(w(Y)>\lambda)=r(s)$. Overall, the RHS of becomes equal to $|\varphi|_2 \times \varepsilon(s)^2$ where

$\varepsilon(s) = \sqrt{e^{-s/4} + r(s)}.$

We will prove this lemma later. From now on, let us prove the theorem.

Proof of the positive part. Apply Markov's inequality and the Lemma to the constant function $1$ and to the function $\varphi$. We get \begin{aligned} &\mathbf{P}(|I_n(1) - 1|>\delta) \leqslant \varepsilon(s)^2/\delta &&\text{and}&&\mathbf{P}(|I_n(\varphi) - I(\varphi)|>\delta') \leqslant |\varphi|_2 \varepsilon(s)^2/\delta'.\end{aligned} If $|I_n(1)-1|<\delta$ and $|I_n(\varphi) - I(\varphi)|<\delta'$ then we see that \begin{aligned}|J_n(\varphi) - I(\varphi)| = |I_n(\varphi)/I_n(1) - I(\varphi)|&\leqslant \frac{|I_n(\varphi) - I(\varphi)| + |I(\varphi)||I_n(1)-1|}{|I_n(1)|} \\ &\leqslant \frac{\delta' + \delta I(\varphi)}{1-\delta} \end{aligned} Now choose $\delta = \varepsilon(s)$ and $\delta' = |\varphi|_2 \varepsilon(s)$. The probability of having at least one of the two events in (15) is smaller than $2\varepsilon(s)$ by the union bound. Moreover, since $|I(\varphi)|\leqslant |\varphi|_2$ by the Cauchy-Schwarz inequality, we get that

$\frac{\delta' + \delta |I(\varphi)|}{1-\delta}\leqslant \frac{2\varepsilon(s)|\varphi|_2}{1 - \varepsilon(s)}$

and the result holds.

Proof of the negative part. Use the Lemma with $\lambda = e^{D - s/2}$ and set $\varphi(x) = \mathbf{1}_{w(x)\leqslant \lambda}$, so that $I(\varphi) = \mathbf{P}(w(Y)\leqslant \lambda) \leqslant r(s/2)$. By the Markov inequality, $\mathbf{P}(w(Y)>\lambda) \leqslant \mathbf{E}[w(Y)]/\lambda = 1/\lambda$. If every $X_i$ has $w(X_i)\leqslant \lambda$, then clearly $J_n(\varphi)=1$. By the union bound we thus have

$\mathbf{P}(J_n(\varphi)\neq 1) \leqslant \sum_{i=1}^n \mathbf{P}(w(X_i)>\lambda) \leqslant \frac{n}{\lambda}=e^{-s/2}.$

## Proof of the Lemma

Set $\psi = \varphi \mathbf{1}_{w(x)\leqslant \lambda}$. Then,

$|I_n(\varphi) - I(\varphi) |\leqslant | I_n(\varphi) - I_n(\psi) |+| I_n(\psi) - I(\psi)| + |I(\psi) - I(\varphi)| = A+B+C$

– Term $C$ is equal to $\int \rho(x)\varphi(x)\mathbb{1}_{w(x)> \lambda}dx$. By the Cauchy-Schwarz inequality it is smaller than

$\sqrt{\int \varphi(x)^2 \rho(x)dx \times \mathbb{P}(w(Y)> \lambda)} = |\varphi|_{2}\sqrt{\mathbb{P}(w(Y)> \lambda)}.$

– Term $B$ is smaller than $\frac{1}{nZ}\sum w(X_i)|\varphi(X_i)| \mathbf{1}_{w(X_i)>\lambda}$ hence its expectation is smaller than $\mathbb{E}[w(X)|\varphi(X)|\mathbf{1}_{w(X)>\lambda}]/Z = \mathbb{E}[\varphi(Y)\mathbf{1}_{w(Y)>\lambda}]$ and we use the same trick as (20).

– Term $A$ is bounded as follows:

\begin{aligned} \mathbf{E}[|I_n(\psi) - I(\psi)|]&\leqslant \sqrt{\mathbf{E}[|I_n(\psi) - I(\psi)|^2]} = \mathrm{Var}(I_n(\psi))\\ &\leqslant\sqrt{\frac{\int \psi(x)^2w(x)^2g(x)dx}{nZ} }\\ &\leqslant \sqrt{ \frac{\int_{w(x)\leqslant \lambda} \varphi(x)^2 w(x) f(x) dx}{n Z}}\\ &\leqslant \sqrt{\lambda \frac{\int \varphi(x)^2 \rho(x) dx}{n} } = \sqrt{\lambda |\varphi|^2_{2}/n}.\\ \end{aligned}

We gather the three bounds and get (13).

# Reference

• The paper by Chatterjee and Diaconis, with a very nice application to an old remark by Knuth on self-avoiding walks.

• A nice paper on the Effective Sample Size by Jun S Liu, with a theoretical justification on why the ESS is significant and a comparison with other sampling techniques.