*Sampling* refers to the generation of random variables following a certain probability distribution; for example if $\rho$ is a probability measure on $\mathbb{R}$, we want to generate random variables which are independent and follow the distribution given by $\rho$, or we want to compute expectations like $\mathbb{E}[\varphi(Y)]$ for some function $\varphi$, where $Y \sim \rho$. In many cases, one does not fully knows $\rho$, but only that it is proportional to some function $f$, that is $\rho(x) = f(x)/Z$ where $Z = \int f(u)du$, and computing $Z$ (the normalizing constant) is intractable.

There are many techniques that still allow to sample from $\rho$ in this case; in this note I'm focusing on *importance sampling* (IS), also called *reweighting*. In this note I prove an insightful result by Chatterjee and Diaconis on the number of samples required for IS to be sufficiently precise.

Let $\nu$ be probability measure, easy to sample from, with density $g$. Let $Y \sim \rho$ and $X\sim \nu$ (in the sequel, $Y_i$ will always have density $\rho$ and $X_i$ density $g$). Then, for any function $\varphi$,

$\mathbb{E}[\varphi(X)f(X)/g(X)] = \int \varphi(x)\frac{f(x)}{g(x)}g(x)dx = \int \varphi(x)f(x)dx = Z \mathbb{E}[\varphi(Y)].$Applying this formula to $\varphi \equiv 1$ also yields $\mathbb{E}[f(X)/g(X)] =Z$. Consequently,

$\frac{ \mathbb{E}[\varphi(X)f(X)/g(X)] }{ \mathbb{E}[f(X)/g(X)] } = \mathbb{E}[\varphi(Y)].$This suggests the following method for approximating $\mathbb{E}[\varphi(Y)]$ without computing $Z$.

Let $X_1, \dotsc, X_n$ be iid with law $\nu$. Set $w(x) = f(x)/g(x)$. Then, when $n\to\infty$, almost surely one has $\frac{\sum_{i=1}^n w(X_i)}{n} \to Z$ and $\frac{\sum_{i=1}^n w(X_i) \varphi(X_i)}{\sum_{i=1}^n w(X_i)} \to \mathbb{E}[\varphi(Y)]= \int \Phi(x)\frac{f(x)}{Z}dx.$

**Proof.** By the Law of Large Numbers, the LHS of (3) converges towards $\mathbf{E}[w(X)] = \int w(x)g(x)dx = \int f(x)dx = Z$.

Also by the LLN, $\frac{\sum w(X_i)\varphi(X_i)}{n}$ converges towards $\mathbf{E}[\varphi(X)w(X)] = Z\mathbf{E}[\varphi(Y)]$.

Take the ratio of the two to get (4).

This technique explains the word *importance sampling*: the samples $X_i$ are iid but their density is $g(x)$, not $f(x)/Z$, and the weights correct the difference between the two. If some $x$ is very likely under $g$ but not so much under $f$ (meaning that $f(x)/Z$ is close to zero but $g(x)$ is high), then this sample will be assigned a very small weight.

Unless $\rho$ and $\nu$ are already very close to each other, this can be pretty bad. Let's look at the variance of the estimator $\hat{Z}$ given in (3): clearly $\mathbf{E}[\hat{Z}]=Z$, hence

$\mathrm{Var}(\hat{Z}) = \frac{\int g(x) \frac{f(x)^2}{g(x)^2}dx - Z^2}{n} = \frac{\int f(x) \frac{f(x)}{g(x)}dx - Z^2}{n} = \frac{Z\mathbf{E}[f(Y)/g(Y)]-Z^2}{n}.$The first term could be prohibitively big. Indeed, let us consider the following situation: our prior density is a Standard Gaussian $g(x) = e^{-x^2/2}/\sqrt{2\pi}$, and our target density is the same, but shifted by 10: $f(x) = e^{-(x-10)^2/2}/\sqrt{2\pi}$. Here both densities are normalized, so $Z=1$. But,

$\mathbf{E}[f(Y)/g(Y)] = \frac{1}{\sqrt{2\pi}}\int e^{-\frac{(x-10)^2}{2}}e^{-\frac{(x-10)^2 }{ 2} + \frac{x^2}{2}}dx = \frac{1}{\sqrt{2\pi}}\int e^{- \frac{(x-20)^2}{2} + 100}dx = e^{100}.$That is too big. Having a low variance for $\hat{Z}$ would require having more than $e^{100}$ samples, which is impossible.

Practically, a good indicator of the quality of our importance sampling is given by the *effective sample size*. Suppose you used $n$ samples $X_i$. If they were distributed exactly as $\rho$, that is if $g$ was proportional to $f$, we would have $w(X_i) = 1/Z$ for all $X_i$. In general $f$ is not proportional to $g$ hence that's not the case and one way to measure how far the samples we have are from $n$ samples of $\rho$, we set $\hat{\sigma}_n$ = the empirical coefficient of variation of the $w(X_i)$, and

Suppose that you use IS with a sample size of $n=1000$. If $\mathrm{ESS} \approx 1000$ then the weigths are almost constant and there's a good chance that our sampling is excellent. If it's small, say $\mathrm{ESS} \approx 100$, then it means that the quality of your IS estimator is the same as if you would have used 100 samples of the real distribution $\rho$.

This method is a good *rule of thumb* to assess the quality of IS, but it's quite empirical. In general, what is the required number of samples when one wants to efficiently estimate a mean $\mathbf{E}[\varphi(Y)]$ using importance sampling?

Let

$D = d_{\mathrm{KL}}(f\mid g) = \int \log (f(x)/g(x)) f(x)dx$be the Kullback-Leibler divergence from $\rho$ to $\nu$ (it is supposed to be $<\infty$). It can also be defined as $D=\mathbf{E}[\log w(Y)]$. In general, concentration inequalities allow to bound how much $\log w(Y)$ is concentrated around its mean $D$: depending on $\nu,\mu$, the deviation probability

$r(s)=\mathbf{P}(|\log w(Y) - D| > s)$should be a decreasing function of $s$ with $r(s)\to 0$. The error terms in the sequel will be expressed using this function $r$. Roughly speaking, $r(s)$ is very small. Note that I chose the two-sided deviation probability, mostly for simplicity.

For any function $\varphi$ we'll note $J_n(\varphi)$ the estimator on the RHS of (4), that is

$J_n(\varphi) = \frac{\sum_{i=1}^n w(X_i) \varphi(X_i)}{\sum_{i=1}^n w(X_i)}.$We also note $I(\varphi) = \int \varphi(x)\rho(x)dx = \mathbf{E}[\varphi(Y)]$ the integral and $|\varphi|^2_2 = \int \varphi(x)^2 \rho(x)dx$ the L2-norm.

**Theorem (Chatterjee and Diaconis, 2015).**

Let $s$ be any positive integer and $\varepsilon(s) = (e^{-s/4} + 2\sqrt{r(s/2)})^{1/2}$.

*Positive part*. Suppose that $n=e^{D+s}$. Then, for any $\varphi$, $\mathbf{P}(|J_n(\varphi) - \mathbf{E}[\varphi(Y)]|>\varepsilon(s) |\varphi|_2)\leqslant 2\varepsilon(s).$ *Negative part*. Conversely if $n=e^{D-s}$, there is a function $\varphi$ such that $\mathbf{E}[\varphi(Y)]\leqslant r(s/2)$ but

The proof almost entirely relies on the following lemma, in which we have set $I_n(\varphi)= \frac{\sum_{i=1}^n \varphi(X_i)w(X_i)}{nZ}$.

As noted earlier we have $\mathbf{E}[\log w(Y)] = \int f(x)\log(f(x)/g(x))dx = D$, hence it is natural to choose $\log \lambda$ at the same scale as $D$. We take $\lambda = e^{D + s/4}$. The ratio $\lambda/n$ becomes $e^{-s/4}$ and $\mathbf{P}(w(Y)>\lambda)=r(s)$. Overall, the RHS of becomes equal to $|\varphi|_2 \times \varepsilon(s)^2$ where

$\varepsilon(s) = \sqrt{e^{-s/4} + r(s)}.$We will prove this lemma later. From now on, let us prove the theorem.

**Proof of the positive part.** Apply Markov's inequality and the Lemma to the constant function $1$ and to the function $\varphi$. We get $\begin{aligned}
&\mathbf{P}(|I_n(1) - 1|>\delta) \leqslant \varepsilon(s)^2/\delta &&\text{and}&&\mathbf{P}(|I_n(\varphi) - I(\varphi)|>\delta') \leqslant |\varphi|_2 \varepsilon(s)^2/\delta'.\end{aligned}$ If $|I_n(1)-1|<\delta$ and $|I_n(\varphi) - I(\varphi)|<\delta'$ then we see that $\begin{aligned}|J_n(\varphi) - I(\varphi)| = |I_n(\varphi)/I_n(1) - I(\varphi)|&\leqslant \frac{|I_n(\varphi) - I(\varphi)| + |I(\varphi)||I_n(1)-1|}{|I_n(1)|} \\
&\leqslant \frac{\delta' + \delta I(\varphi)}{1-\delta} \end{aligned}$ Now choose $\delta = \varepsilon(s)$ and $\delta' = |\varphi|_2 \varepsilon(s)$. The probability of having at least one of the two events in (15) is smaller than $2\varepsilon(s)$ by the union bound. Moreover, since $|I(\varphi)|\leqslant |\varphi|_2$ by the Cauchy-Schwarz inequality, we get that

and the result holds.

**Proof of the negative part.** Use the Lemma with $\lambda = e^{D - s/2}$ and set $\varphi(x) = \mathbf{1}_{w(x)\leqslant \lambda}$, so that $I(\varphi) = \mathbf{P}(w(Y)\leqslant \lambda) \leqslant r(s/2)$. By the Markov inequality, $\mathbf{P}(w(Y)>\lambda) \leqslant \mathbf{E}[w(Y)]/\lambda = 1/\lambda$. If every $X_i$ has $w(X_i)\leqslant \lambda$, then clearly $J_n(\varphi)=1$. By the union bound we thus have

Set $\psi = \varphi \mathbf{1}_{w(x)\leqslant \lambda}$. Then,

$|I_n(\varphi) - I(\varphi) |\leqslant | I_n(\varphi) - I_n(\psi) |+| I_n(\psi) - I(\psi)| + |I(\psi) - I(\varphi)| = A+B+C$**– Term $C$** is equal to $\int \rho(x)\varphi(x)\mathbb{1}_{w(x)> \lambda}dx$. By the Cauchy-Schwarz inequality it is smaller than

**– Term $B$** is smaller than $\frac{1}{nZ}\sum w(X_i)|\varphi(X_i)| \mathbf{1}_{w(X_i)>\lambda}$ hence its expectation is smaller than $\mathbb{E}[w(X)|\varphi(X)|\mathbf{1}_{w(X)>\lambda}]/Z = \mathbb{E}[\varphi(Y)\mathbf{1}_{w(Y)>\lambda}]$ and we use the same trick as (20).

**– Term $A$** is bounded as follows:

We gather the three bounds and get (13).

The paper by Chatterjee and Diaconis, with a very nice application to an old remark by Knuth on self-avoiding walks.

A nice paper on the Effective Sample Size by Jun S Liu, with a theoretical justification on why the ESS is significant and a comparison with other sampling techniques.