Importance sampling ⚖️⚖️ : The Jarzynski connection

March 2022

Let VtV_t be a family of smooth potentials on Rn\mathbb{R}^n, continuously evolving with time t[0,1]t\in[0,1]. Let ptp_t be their associated Gibbs measure: 

pt(x)=eVt(x)ZtwhereZt=RneVt(x)dx.\begin{aligned} &p_t(x) = \frac{e^{-V_t(x)}}{Z_t}&& \text{where}&& Z_t = \int_{\mathbb{R}^n} e^{-V_t(x)}dx.\end{aligned}

It is not easy to find a continuous dynamic XtX_t, such that at each time tt the distribution of XtX_t matches ptp_t. One might be tempted to use the overderdamped Langevin dynamics,

dXt=Vt(Xt)dt+2dWtdX_t = - \nabla V_t(X_t)dt + \sqrt{2}dW_t

with (Wt)(W_t) a Brownian motion. But the density ρt\rho_t of XtX_t is not equal to ptp_t in general. Indeed, ρt\rho_t solves the Fokker-Planck equation

ρ˙t=Δρt[ρtVt].\dot{\rho}_t = \Delta\rho_t - \nabla \cdot [\rho_t \nabla V_t].

If ρt\rho_t was equal to ptp_t, then ptp_t should also satisfy this equation; but for ptp_t one easily checks that the RHS is zero, resulting in p˙t=0\dot{p}_t=0 which is false in general if we choose a family of evolving potentials; it would only be true at equilibrium when VtV_t is constant.

The Jarczynski trick allows to reconnect the distribution of XtX_t with ptp_t. It introduces a weight, the solution of an auxiliary ODE, such that its conditional expectation given Xt=xX_t=x gives pt(x)p_t(x), hence correcting the mismatch between ρt\rho_t and ptp_t.

(In the sequel we note Ft=logZtF_t = \log Z_t, and the dot in V˙t\dot{V}_t or F˙t\dot{F}_t always represents a derivative with respect to time. )

Let (Xt,Wt)(X_t, W_t) be the solution of the following augmented system: dXt=Vt(Xt)dt+2dWtX0p0dWt=(V˙t(Xt)Ft˙)WtdtW0=1.\begin{aligned} &dX_t = - \nabla V_t(X_t)dt + \sqrt{2}dW_t &&& X_0 \sim p_0 \\&dW_t = (-\dot{V}_t(X_t) - \dot{F_t})W_t dt &&& W_0=1. \end{aligned} Then, for any test function φ\varphi, E[φ(Xt)Wt]=φ(x)eVt(x)Ztdx.\mathbf{E}[\varphi(X_t)W_t] = \int \varphi(x)\frac{e^{-V_t(x)}}{Z_t}dx. In other words, the probability measure P(A)=E[1AWt]P(A) = \mathbf{E}[\mathbf{1}_A W_t] is exactly ptp_t.

Jarczynski's goal was to compute, at least numerically, the partition functions ZtZ_t or equivalently the free energy Ft=logZtF_t = - \log Z_t. The Jarczynski estimator, discovered in 1996, simply consists in applying (5) to φ=1\varphi=1, and noting that WtW_t is a simple ODE solved by

Wt=W0e0tV˙s(Xs)F˙sds=ZtZ0e0tV˙s(Xs)ds.W_t = W_0e^{-\int_0^t \dot{V}_s(X_s) - \dot{F}_s ds} = \frac{Z_t}{Z_0}e^{-\int_0^t \dot{V}_s(X_s)ds}.

Jarzynski's identity states that

E[e0tV˙s(Xs)F˙sds]=1.\mathbf{E}[e^{-\int_0^t \dot{V}_s(X_s) - \dot{F}_sds}] = 1.

The consequence of this identity is that

logE[e0tV˙s(Xs)ds]=FtF0, \log \mathbf{E}[e^{-\int_0^t \dot{V}_s(X_s)ds}] = F_t - F_0,

giving rise to a natural estimator for the free energy difference FtF0F_t-F_0: one samples many paths from the dynamics (2), then computes the path integral wt=0tV˙s(Xs)dsw_t = \int_0^t \dot{V}_s(X_s)ds along each path, and then average the values of ewte^{-w_t} and take the log.

In general, exponential of paths integrals are not easy to study. The classical way to prove Jarczynski's identity directly is to use the Feynman-Kac representation of PDEs, but this proof seems quite involved; instead, the trick above with the augmented system seems to be easier. It was shown to me by Eric Vanden-Eijnden.


Equation (5) can be reformulated as follows: if we note ϱt(x,w)\varrho_t(x,w) the density of (Xt,Wt)(X_t, W_t), then

pt(x)=0wϱt(x,w)dw. p_t(x) = \int_0^\infty w \varrho_t(x,w)dw.

Note that the integral is only on the positive axis because WtW_t is always positive from (6). We note qt(x)q_t(x) the term on the right; we are going to prove pt=qtp_t=q_t by proving that both satisfy the same parabolic equation, then invoke a uniqueness result.

From now on, we'll set Ut=VtFtU_t = V_t - F_t, so that pt(x)=eUt(x)p_t(x) = e^{-U_t(x)} is normalized. Clearly, Ut=Vt\nabla U_t = \nabla V_t.

The augmented system

We consider the system Zt=(Xt,Wt)Z_t= (X_t, W_t) in (4) as a single SDE

dZt=ft(Zt)dt+σdWtd Z_t = f_t(Z_t)dt + \sigma dW_t


ft(x,w)=(Ut(x)U˙t(x)w) and σ=diag(2,,2,0).\begin{aligned}&f_t(x,w) = \begin{pmatrix} - \nabla U_t(x) \\ -\dot{U}_t(x)w \end{pmatrix} &&\text{ and }&& \sigma = \mathrm{diag}(\sqrt{2}, \dotsc, \sqrt{2}, 0).\end{aligned}

Let us note ϱt(x,y)\varrho_t(x,y) the density of (Xt,Wt)(X_t, W_t), and write the associated Fokker-Planck equation; here and after, \nabla is a gradient and \nabla \cdot is the divergence. For clarity in this section, subscripts denote partial differentiation with respect to the corresponding variables.

ϱ˙t=x,w[ftϱt]+Δxϱt=(x,wft)ϱtftx,wϱt+Δxϱt\begin{aligned}\dot{\varrho}_t &= -\nabla_{x,w} \cdot \left[ f_t\varrho_t \right] + \Delta_x \varrho_t\\ &= - (\nabla_{x,w} \cdot f_t) \varrho_t - f_t \cdot \nabla_{x,w} \varrho_t + \Delta_x \varrho_t \\ \end{aligned}

Now by the definition of ftf_t, we have x,wft(x,w)=xxUt(x)U˙t(x)\nabla_{x,w}f_t(x,w) = - \nabla_x \nabla_x U_t(x) - \dot{U}_t(x), hence finally

ϱ˙t(x,w)=(ΔxUt(x))ϱt(x,w)+U˙t(x)ϱt(x,w)ft(x,w)x,wϱt(x,w)+Δxϱt(x,w).\dot \varrho_t(x,w)=- (\Delta_x U_t(x)) \varrho_t(x,w) + \dot{U}_t(x)\varrho_t(x,w) - f_t(x,w) \cdot \nabla_{x,w} \varrho_t (x,w) + \Delta_x \varrho_t(x,w).

The marginal integration

We already set qt(x)=0wϱt(x,w)dwq_t(x) = \int_0^\infty w \varrho_t(x,w)dw, so that E[Wt]=qt(x)dx\mathbf{E}[W_t] = \int q_t(x)dx for example. Multiplying the last equation by ww, integrating, and swapping integrals and derivatives, we get

q˙t=wϱ˙t(x,w)dw=(ΔxUt)qt+U˙tqtw(ftx,wϱt)+Δxqt\begin{aligned}\dot{q}_t &= \int w \dot{\varrho}_t(x,w)dw \\ &= - (\Delta_x U_t) q_t + \dot{U}_t q_t - \int w (f_t \cdot \nabla_{x,w} \varrho_t) + \Delta_x q_t \end{aligned}

We have ftx,wϱt=xUtxϱtU˙twwϱtf_t \cdot \nabla_{x,w} \varrho_t = - \nabla_x U_t \cdot \nabla_x \varrho_t - \dot{U}_t w \nabla_w \varrho_t, hence

q˙t=(ΔUt)qt+U˙tqt+xUtxqt+U˙tw2wϱt(x,w)dw+Δxqt.\begin{aligned}\dot{q}_t &= - (\Delta U_t) q_t + \dot{U}_t q_t + \nabla_x U_t \cdot \nabla_x q_t + \dot{U}_t \int w^2 \nabla_w \varrho_t(x,w)dw+ \Delta_x q_t. \end{aligned}

An integration by parts shows that 0w2wϱt(x,w)dw=20wϱt(x,w)dw=2qt(x)\int_0^\infty w^2 \nabla_w \varrho_t(x,w)dw = -2 \int_0^\infty w \varrho_t(x,w)dw = -2q_t(x), provided a sufficient decay of ww2ϱ(x,w)w \to w^2 \varrho(x,w) at infinity. We obtain the following equation:

q˙t=(ΔxUt)qtU˙tqt+xUtxqt+Δxqt=Δqt[qtUt]U˙tqt\begin{aligned} \dot{q}_t &= -(\Delta_x U_t)q_t - \dot{U}_t q_t + \nabla_x U_t \cdot \nabla_x q_t + \Delta_x q_t \\ &= \Delta q_t - \nabla \cdot [q_t \nabla U_t ] - \dot{U}_t q_t \end{aligned}

where I removed the subscript for clarity. This equation is a Fokker-Planck equation (first terms) but with a birth-death term added at the end.

It turns out that the density pt(x)=eUt(x)p_t(x) = e^{-U_t(x)} is a solution of this equation. Indeed, p˙t=U˙tpt\dot{p}_t = - \dot{U}_t p_t and pt=Utpt\nabla p_t = -\nabla U_t p_t, so that it is easy to check that (16) is satisfied.

Uniqueness of solutions to parabolic PDEs

We're now left with checking that the solutions of

u˙t=Δut[utBt]+Ctut\dot{u}_t = \Delta u_t - \nabla \cdot [u_t B_t] + C_tu_t

are essentially unique if they start with the same initial conditions, which is the case here. In the equation Bt=UtB_t = \nabla U_t and Ct=U˙t C_t=-\dot{U}_t. Uniqueness for solutions of parabolic equations is well-known to PDE theorists and can be found in Evans' book, chapter 7. We provide a proof for completeness. Indeed, by linearity, it is sufficient to check that the only solution of (17) started at u0(x)=0u_0(x) = 0 is identically zero. Set γ(t):=ut(x)2dx\gamma(t):=\int |u_t(x)|^2dx; we're going to bound its derivative using "energy estimates". We have

γ˙(t)=utΔutut[utBt]+ut2Ct.\begin{aligned}\dot{\gamma}(t) &= \int u_t \Delta u_t - u_t\nabla\cdot [u_t B_t] + |u_t|^2 C_t. \end{aligned}

The first term is ut2-\int |\nabla u_t|^2. The second term is equal to +ututBt+\int \nabla u_t \cdot u_t B_t by integration by parts. Young's inequality abλa2+b2/4λa \cdot b \leqslant \lambda |a|^2 + |b|^2 / 4\lambda, valid for any λ>0\lambda>0, yields

ututFtλut2+14λut2Bt2.\int \nabla u_t \cdot u_tF_t \leqslant \lambda \int |\nabla u_t|^2 + \frac{1}{4\lambda}\int |u_t|^2 |B_t|^2.

In particular with λ=1\lambda=1 we cancel the first term and we obtain

γ˙(t)14λut2(Bt2+Ct).\dot{\gamma}(t) \leqslant \frac{1}{4\lambda}\int |u_t|^2 (|B_t|^2 + C_t).

Now, if Bt,CtB_t, C_t are bounded by some constant cc, we get γ˙(t)cγ(t)\dot{\gamma}(t) \leqslant c \gamma(t). Grönwall's lemma yields γ(t)γ(0)ect=0\gamma(t)\leqslant \gamma(0)e^{ct}=0 and since γ(t)\gamma(t) is always positive, we get γ(t)=0\gamma(t)=0 and ut=0u_t=0 for all tt.