Diffusion models

March 2023

These notes focus on diffusion-based generative models, like the celebrated Denoising Diffusion Probabilistic Models; the material was presented as a series of lectures I gave at some working groups of mathematicians, so the style is tailored for this audience. In particular, everything is fitted into the continuous-time framework (which is not how it is done in practice).

A special attention is given to the differences between ODE sampling and SDE sampling. The analysis of the time evolution of the densities ptp_t is done using only Fokker-Planck Equations or Transport Equations.

The problem

Let pp be a probability density on Rd\mathbb{R}^d. The goal of generative modelling is twofold: given samples x1,,xnx^1, \dotsc, x^n from pp, we want to

  1. learn pp

  2. generate new samples from pp.

There are various methods for tackling these challenges: Energy-Based Models, Normalizing Flows and the famous Neural ODEs, vanilla Score-Matching. However, each method has its limitations. For example, EBMs are very challenging to train, NFs lack expressivity and SM fails to capture multimodal distributions. Diffusion models offer sufficient flexibility to (partially) overcome these limitations.

Stochastic interpolation

Diffusion models fall into the general framework of stochastic interpolants. The central idea is to continuously transform the density pp into another easy-to-sample density π\pi (often called the target), while also transforming the samples xix^i from pp into samples from π\pi; and then, to reverse the process: that is, to generate a sample from π\pi, and to inverse the transformation to get a new sample from pp. In other words, we seek a path (pt: t[0,T])(p_t: t\in [0,T]) with p0=pp_0=p and pT=qp_T=q, such that generating samples xtptx_t \sim p_t is doable.

The success of diffusion models came from the realization that some stochastic processes, such as Ornstein-Uhlenbeck processes that connect p0p_0 with a distribution pTp_T very close to pure noise N(0,I)\mathscr{N}(0,I), can be reversed when the score function logpt\nabla \log p_t is available at each time tt. Although unknown, this score can efficiently be learnt using statistical procedures called score matching.

Original formulation: Gaussian noising process and its inversion

Let (t,x)ft(x)(t,x)\to f_t(x) and tσtt\to \sigma_t be two smooth functions. Consider the stochastic differential equation

dXt=ft(Xt)dt+2σt2dBt,X0p\begin{aligned}& dX_t = f_t(X_t)dt + \sqrt{2\sigma_t ^2}dB_t, \\ & X_0 \sim p\end{aligned}

where dBtdB_t denotes integration with respect to a Brownian motion. Under mild conditions on ff, an almost-surely continuous stochastic process satisfying this SDE exists. Let ptp_t be the probability density of XtX_t; it is known that this process could easily be reversed in time. More precisely, the SDE

dYt=(ft(Yt)+2σt2logpt(Yt))dt+2σt2dBtYTpT\begin{aligned} & dY_t = -\left( f_t(Y_t)+ 2\sigma_t^2 \nabla \log p_t(Y_t) \right)dt + \sqrt{2\sigma_t^2}dB_t \\ & Y_T \sim p_T \end{aligned}

has the same marginals as XtX_t reversed in time: more precisely YTtY_{T-t} has the same distribution as XtX_t, with density noted ptp_t. This inversion needs access to logpt\nabla \log p_t, and we'll explain later how this can be done.

For simple functions ff, the process (1) has an explicit representation. Here we focus on the case where ft(x)=αtxf_t(x) = -\alpha_t x for some function α\alpha, that is

dXt=αtXt+2σt2dBt. dX_t = -\alpha_t X_t + \sqrt{2\sigma_t^2}dB_t.
Define μt=0tαsds\mu_t = \int_0^t \alpha_s ds. Then, the solution of (3) is given by the following stochastic process: Xt=eμtX0+20teμsμtσsdBs. X_t = e^{-\mu_t}X_0 + \sqrt{2}\int_0^t e^{\mu_s-\mu_t} \sigma_s dB_s.

In particular, the second term reduces to a Wiener Integral; it is a centered Gaussian with variance 20te2(μsμt)σs2ds2\int_0^t e^{2(\mu_s-\mu_t)}\sigma_s^2 ds, hence

Xt=lawetX0+N(0,20te2μs2μtσs2ds). X_t \stackrel{\mathrm{law}}{=} e^{-t}X_0 + \mathscr{N}\left(0, 2\int_0^t e^{2\mu_s - 2\mu_t}\sigma_s^2 ds\right).

In the pure Orstein-Uhlenbeck case where σt=σ\sigma_t = \sigma and αt=1\alpha_t = 1, we get μt=t\mu_t = t and Xt=etX0+N(0,1e2t)X_t = e^{-t}X_0 + \mathscr{N}(0,1 - e^{-2t}).

Proof of (4). We set F(x,t)=xeμtF(x,t) = xe^{\mu_t} and Yt=F(Xt,t)=XteμtY_t = F(X_t, t) = X_t e^{\mu_t}; it turns out that YtY_t satisfies a nicer SDE. Since Δxf=0\Delta_x f = 0, tf(x,t)=xeμtαt\partial_t f(x,t) = xe^{\mu_t}\alpha_t and xf(x,t)=eμt\nabla_x f(x,t) = e^{\mu_t}, Itô's formula says that dYt=tF(t,Xt)dt+xF(t,Xt)dXt+12ΔxF(t,Xt)dt=Xteμtαtdt+eμtdXt=2σt2e2μtdBt.\begin{aligned}dY_t &= \partial_tF(t,X_t)dt + \partial_x F(t,X_t)dX_t + \frac{1}{2}\Delta_x F(t,X_t)dt \\ &= X_te^{\mu_t}\alpha_tdt + e^{\mu_t} dX_t \\ &= \sqrt{2\sigma_t^2 e^{2\mu_t}}dB_t. \end{aligned} Consequently, Yt=Y0+0t2σs2e2μtdBsY_t = Y_0 + \int_0^t \sqrt{2\sigma_s^2e^{2\mu_t}}dB_s and the result holds.

A consequence of the preceding result is that when the variance

σˉt2=20te2μs2μtσs2ds\bar{\sigma}_t^2 = 2\int_0^t e^{2\mu_s - 2\mu_t}\sigma_s^2 ds

is big compared to eμte^{-\mu_t}, then the distribution of XtX_t is well-approximated by N(0,σˉt2)\mathscr{N}(0,\bar{\sigma}_t^2). Indeed, for σt=1\sigma_t = 1, we have σˉT=1e2T1\bar{\sigma}_T = \sqrt{1 - e^{-2T}} \approx 1 if TT is sufficiently large.

The Fokker-Planck point of view

It has recently been recognized that the Ornstein-Uhlenbeck representation of ptp_t as in (1), as well as the stochastic process (2) that has the same marginals as ptp_t, are not necessarily unique or special. Instead, what matters are two key features: (i) ptp_t provides a path connecting pp and pTN(0,I)p_T\sim N(0,I), and (ii) its marginals are easy to sample. There are many other processes besides (1) that have ptp_t as their marginals, and that can also be reversed. The crucial point is that ptp_t is a solution of the Fokker-Planck equation:

tpt(x)=Δ(σt2pt(x))(ft(x)pt(x)). \partial_t p_t(x) = \Delta (\sigma_t^2 p_t(x)) - \nabla \cdot (f_t(x)p_t(x)).

Just to settle the notations once and for all: \nabla is the gradient, and for a function ρ:RdRd\rho : \mathbb{R}^d \to \mathbb{R}^d, ρ(x)\nabla \cdot \rho(x) stands for the divergence, that is i=1dxiρ(x1,,xd)\sum_{i=1}^d \partial_{x_i} \rho(x_1, \dotsc, x_d), and =Δ=i=1dxi2\nabla \cdot \nabla = \Delta = \sum_{i=1}^d \partial^2_{x_i} is the Laplacian.

Importantly, equation (8) can be recast as a transport equation: with a velocity field defined as

vt(x)=σt2logpt(x)ft(x),v_t(x) = \sigma_t^2 \nabla \log p_t(x) - f_t(x),

the equation (8) is equivalent to

tpt(x)=(vt(x)pt(x)). \partial_t p_t(x) = \nabla \cdot (v_t(x)p_t(x)).
Proof. vt(x)pt(x)=(logpt(x))pt(x)ft(x)pt(x)=pt(x)ft(x)pt(x)\nabla \cdot v_t(x)p_t(x) = \nabla\cdot \nabla (\log p_t(x))p_t(x) - \nabla\cdot f_t(x)p_t(x)= \nabla\cdot \nabla p_t(x) - \nabla\cdot f_t(x)p_t(x)

An associated ODE

Transport equations like (10) come from simple ODEs; that is, there is a deterministic process with the same marginals as (1).

Let x(t)x(t) be the solution of the differential equation with random initial condition x(t)=vt(x(t))x(0)=X0.x'(t) = -v_t(x(t))\qquad \qquad x(0) =X_0. Then the probability density of x(t)x(t) satisfies (10), hence it is equal to ptp_t.
Proof. Let ptp_t be the probability density of x(t)x(t) and let φ\varphi be any smooth, compactly supported test function. Then, E[φ(x(t))]=pt(x)φ(x)dx\mathbb{E}[\varphi(x(t))] = \int p_t(x)\varphi(x)dx, so by derivation under the integral, tpt(x)φ(x)dx=tE[φ(x(t))]=E[φ(x(t))x(t)]=φ(x)vt(x)pt(x)dx=φ(x)(vt(x)pt(x))dx\begin{aligned}\int \partial_t p_t(x)\varphi(x)dx = \partial_t \mathbb{E}[\varphi(x(t))]&= \mathbb{E}[\nabla\varphi(x(t))x'(t)]\\ &= -\int \nabla \varphi(x)v_t(x)p_t(x)dx = \int \varphi(x) \nabla \cdot (v_t(x)p_t(x))dx \end{aligned} where the last line uses the multidimensional integration by parts formula.

Up to now, we proved that there are two continuous random processes having the same marginal probability density at time tt: a smooth one provided by x(t)x(t), the solution of the ODE, and a continuous but not differentiable one, XtX_t, provided by the solution of the SDE.

Time-reversal of Transport Equations and Fokker-Planck equations

We now have various processes x(t),Xtx(t), X_t starting at a density p0p_0 and evolving towards a density pTπ=N(0,I)p_T \approx \pi = \mathscr{N}(0,I). Can these processes be reversed in time? The answer is yes for both of them. We'll start by reversing their associated equations. From now on, we will note ptbp^{\mathrm{b}}_t the time-reversal of ptp_t, that is:

ptb(x)=pTt(x).p^{\mathrm{b}}_t(x) = p_{T-t}(x).

The density ptbp^{\mathrm{b}}_t solves the backward Transport Equation: ptb(x)=vtb(x)ptb(x)\partial p^{\mathrm{b}}_t(x)= \nabla \cdot v^{\mathrm{b}}_t(x) p^{\mathrm{b}}_t(x) where

vtb(x)=vt(x)=σTt2logpt(x)αTtx.v^{\mathrm{b}}_t(x) = -v_t(x) = -\sigma_{T-t}^2 \nabla \log p_t(x) - \alpha_{T-t} x.

The density ptbp^{\mathrm{b}}_t also solves the backward Fokker-Planck Equation: ptb(x)=σTt2Δptb(x)wtb(x)ptb(x)\partial p^{\mathrm{b}}_t(x) =\sigma_{T-t}^2 \Delta p^{\mathrm{b}}_t(x) - \nabla \cdot w_t^{\mathrm{b}}(x)p^{\mathrm{b}}_t(x) where

wtb(x)=2σTt2logptb(x)+αTtx.w^{\mathrm{b}}_t(x) = 2\sigma_{T-t}^2 \nabla \log p^{\mathrm{b}}_t(x) + \alpha_{T-t} x.
Proof. Noting p˙t(x)\dot{p}_t(x) the time derivative of tpt(x)t\mapsto p_t(x) at time tt, we immediately see that tptb(x)=p˙Tt(x)\partial_t p^{\mathrm{b}}_t(x) = -\dot{p}_{T-t}(x) and the rest is a mere verification.

Of course, these two equations are the same, but they represent the time-evolution of the density of two different random processes. As explained before, the Transport version (14) represents the time-evolution of the density of the ODE system

y(t)=vtb(y(t))y(0)pT\begin{aligned}& y'(t) = -v^{\mathrm{b}}_t(y(t)) \\ & y(0) \sim p_T\end{aligned}

while the Fokker-Planck version (16) represents the time-evolution of the SDE system

dYt=wtb(Yt)dt+2σTt2dBtY0pT.\begin{aligned}&dY_t = w^{\mathrm{b}}_t(Y_t)dt + \sqrt{2\sigma_{T-t}^2}dB_t \\ & Y_0 \sim p_T.\end{aligned}

Both of these two processes can be sampled using a range of ODE and SDE solvers, the simplest of which being the Euler scheme and the Euler-Maruyama scheme. However, this requires access to the functions vtbv^{\mathrm{b}}_t and wtbw^{\mathrm{b}}_t, which in turn depend on the unknown score logpt\nabla \log p_t. Fortunately, logpt\nabla \log p_t can efficiently be estimated due to two factors.

  1. First: we have samples from ptp_t. Remember that our only information about pp is a collection x1,,xnx^1, \dotsc, x^n of samples. But thanks to the representation (5), we can represent xti=eμtxi+σˉtξix^i_t = e^{-\mu_t}x^i + \bar{\sigma}_t \xi^i with ξiN(0,I)\xi^i \sim \mathscr{N}(0,I) are samples from ptp_t. They are extremely easy to access, since we only need to generate iid standard Gaussian variables ξi\xi^i.

  2. Second: score matching. If pp is a probability density and xix^i are samples from pp, estimating logp\nabla \log p (called score) has been thoroughly examined and is fairly doable, a technique known as score matching.

Methods for learning the score

The L2-distance between the scores of two probability densities is often called the Fisher divergence

fisher(ρ1ρ2)=ρ1(x)logρ1(x)logρ2(x)2dx. \mathrm{fisher}(\rho_1 \mid \rho_2) = \int \rho_1(x)|\nabla\log\rho_1(x) - \nabla\log\rho_2(x)|^2dx.

Since our goal is to learn logpt(x)\nabla\log p_t(x), it is natural to choose a parametrized family of functions sθs_\theta and to optimize θ\theta so that the divergence

pt(x)logpt(x)sθ(x)2dx\int p_t(x)|\nabla\log p_t(x) - s_\theta(x)|^2dx

is as small as possible. However, this optimization problem is intractable, due to the presence of the explicit form of ptp_t inside the integral. This is where Score Matching techniques come into play.

Vanilla score matching

Let pp be a smooth probability density function supported over Rd\mathbb{R}^d and let XX be a random variable with density pp. The following elementary identity is due to Hyvärinen, 2005; it is the basis for score matching estimation in statistics.

Let s: RdRds : \mathbb{R}^d \to \mathbb{R}^d be any smooth function with sufficiently fast decay at \infty, and XpX \sim p. Then, E[logp(X)s(X)2]=c+E[s(X)2+2s(X)] \mathbb{E}[\vert \nabla \log p(X) - s(X)\vert^2] = c + \mathbb{E}\left[|s(X)|^2 + 2 \nabla \cdot s(X)\right] where cc is a constant not depending on ss.

Proof. We start by expanding the square norm: p(x)logp(x)s(x)2dx=p(x)logp(x)2dx+p(x)s(x)2dx2logp(x)p(x)s(x)dx.\begin{aligned}\int p(x)|\nabla \log p(x) - s(x)|^2 dx &= \int p(x)|\nabla \log p(x)|^2 dx + \int p(x)|s(x)|^2 dx - 2\int \nabla \log p(x)\cdot p(x)s(x) dx. \end{aligned} The first term does not depend on ss, it is our constant cc. For the last term, we use logp=p/p\nabla \log p = \nabla p / p then we use the integration-by-parts formula: 

2logp(x)p(x)s(x)dx=2p(x)s(x)dx=2p(x)(s(x))dx2\int \nabla \log p(x)\cdot p(x)s(x) dx = 2\int \nabla p(x) \cdot s(x) dx = -2 \int p(x)( \nabla \cdot s(x))dx

and the identity is proved.

Now, (22) is particularly interesting for us. Remember that if we want to reverse (11), we do not really need to estimate ptp_t but only logpt\nabla \log p_t. We do so by approximating it using a parametrized family of functions, say sθs_\theta (typically, a neural network):

θtarg minθE[logpt(Xt)sθ(Xt)2]=arg minθE[sθ(Xt)2+2(sθ(Xt))]. \theta_t \in \argmin_\theta \mathbb{E}[\vert \nabla \log p_t(X_t) - s_{\theta}(X_t)\vert^2] = \argmin_\theta \mathbb{E}[|s_{\theta}(X_t)|^2 + 2 \nabla \cdot (s_{\theta}(X_t))].

How do we empirically optimize (25)?

  1. First, we need not solve this optimization problem for every tt. We could obviously discretize [0,T][0,T] with t1,,tNt_1, \dots, t_N and only solve for θti\theta_{t_i} independently, but it is actually smarter and cheaper to approximate the whole function (t,x)logpt(x)(t,x) \to \nabla \log p_t(x) by a single neural network (a U-net, in general). That is, we use a parametrized family sθ(t,x)s_\theta(t,x). This enforces a form of time-continuity which seems natural. Now, since we want to aggregate the losses at each time, we solve the following problem: 

arg minθ0Tw(t)E[sθ(t,Xt)2+2(sθ(t,Xt))]dt\argmin_\theta \int_0^T w(t)\mathbb{E}[|s_{\theta}(t, X_t)|^2 + 2 \nabla \cdot (s_{\theta}(t, X_t))]dt

where w(t)w(t) is a weighting function (for example, w(t)w(t) can be higher for t0t\approx 0, since we don't really care about approximating pTp_T as precisely as p0p_0).

  1. In the preceding formulation we cannot exactly compute the expectation with respect to ptp_t, but we can approximate it with our samples xtix_t^i. Additionnaly, we need to approximate the integral, for instance we can discretize the time steps with t0=0<t1<<tN=Tt_0=0 < t_1 < \dots < t_N = T. Our objective function becomes

(θ)=1nt{t0,,tN}w(t)i=1nsθ(t,xti)2+2(sθ(t,xti)) \ell(\theta) =\frac{1}{n}\sum_{t \in \{t_0, \dots, t_N\}} w(t)\sum_{i=1}^n |s_{\theta}(t, x_t^i)|^2 + 2 \nabla\cdot(s_{\theta}(t, x_t^i))

which looks computable… except it's not ideal. Suppose we perform a gradient descent on θ\theta to find the optimal θ\theta for time tt. Then at each gradient descent step, we need to evaluate sθs_{\theta} as well as its divergence; and then compute the gradient in θ\theta of the divergence in xx, in other words to compute θxsθ\nabla_\theta \nabla_x \cdot s_\theta. In high dimension, this can be too costly.

Denoising Score Matching

Fortunately, there is another way to perform score matching when ptp_t is the distribution of a random variable with gaussian noise added, as in our setting. We'll present this result in a fairly abstract setting; we suppose that pp is a density function, and q=pgq = p*g where gg is an other density. The following result is due to Vincent, 2010.

Denoising Score Matching Objective

Let s:RdRds:\mathbb{R}^d \to \mathbb{R}^d be a smooth function. Let XX be a random variable with density pp, ε\varepsilon an independent random variable with density gg, and Xε=X+εX_\varepsilon = X + \varepsilon, whose density is pg=pgp_g = p * g. Then, E[logpg(Xε)s(Xε)2]=c+E[logg(ε)s(Xε)2] \mathbb{E}[\vert \nabla \log p_g(X_\varepsilon) - s(X_\varepsilon)\vert^2] = c + \mathbb{E}[|\nabla \log g(\varepsilon) - s(X_\varepsilon)|^2] where cc is a constant not depending on ss.

Proof. By the same computation as for vanilla score matching, we have

E[logpg(Xε)s(Xε)2]=c+pg(x)s(x)2dx2pg(x)s(x)dx. \mathbb{E}[\vert \nabla \log p_g(X_\varepsilon) - s(X_\varepsilon)\vert^2] = c + \int p_g(x)|s(x)|^2dx -2\int \nabla p_g(x)\cdot s(x)dx.

Now by definition, pg(x)=p(y)g(xy)dyp_g(x) = \int p(y)g(x-y)dy, hence pg(x)=p(y)g(xy)dy\nabla p_g(x) = \int p(y)\nabla g(x-y)dy, and the last term above is equal to 2p(y)g(xy)s(x)dxdy=2p(y)g(xy)logg(xy)s(x)dydx=2E[logg(ε)s(X+ε)].\begin{aligned} -2\int \int p(y)\nabla g(x-y)\cdot s(x)dxdy &= -2\int \int p(y)g(x-y)\nabla \log g(x-y)\cdot s(x)dydx\\ &= -2\mathbb{E}[\nabla \log g(\varepsilon)\cdot s(X + \varepsilon)]. \end{aligned} This last term is equal to 2E[logg(ε)s(Xε)]-2\mathbb{E}[\nabla \log g(\varepsilon)\cdot s(X_\varepsilon)]. But then, upon adding and subtracting the term E[logg(ε)2]\mathbb{E}[|\nabla \log g(\varepsilon)|^2] which does not depend on ss, we get another constant cc' such that

E[logpg(X)s(X)2]=c+E[logg(ε)s(X+ε)2]. \mathbb{E}[\vert \nabla \log p_g(X) - s(X)\vert^2] = c' + \mathbb{E}[|\nabla \log g(\varepsilon) - s(X + \varepsilon)|^2].

Now, this Denoising Score Matching loss does not involve any computation of a « double gradient » like θxsθ\nabla_\theta \nabla_x \cdot s_\theta.

Let us apply this to our setting. Remember that ptp_t is the density of eμtX0+εte^{-\mu_t}X_0 + \varepsilon_t where εtN(0,σˉt2)\varepsilon_t \sim \mathscr{N}(0,\bar{\sigma}_t^2), hence in this case g(x)=(2πσˉt2)d/2ex2/2σˉt2g(x) = (2\pi\bar{\sigma}_t^2)^{-d/2}e^{-|x|^2 / 2\bar{\sigma}_t^2} and logg(x)=x/σˉt2\nabla \log g(x) = - x / \bar{\sigma}^2_t. The objective in (26) becomes

arg minθ0Tw(t)E[εtσˉt2sθ(t,eμtX0+εt)2]dt. \argmin_\theta \int_0^T w(t)\mathbb{E}\left[\left|-\frac{\varepsilon_t}{\bar{\sigma}_t^2} - s_\theta(t, e^{-\mu_t}X_0 + \varepsilon_t) \right|^2\right]dt.

This can be further simplified. Indeed, let us slightly change the parametrization and use rθ(t,x)=σˉtsθ(t,x)r_\theta(t,x) = -\bar{\sigma}_t s_\theta(t,x). Then,

arg minθ0Tw(t)σˉtE[ξrθ(t,eμtX0+σˉtξ)2]dt. \argmin_\theta \int_0^T \frac{w(t)}{\bar{\sigma}_t}\mathbb{E}\left[\left|\xi - r_\theta(t, e^{-\mu_t}X_0 + \bar{\sigma}_t \xi) \right|^2\right]dt.

Intuitively, the neural network rθr_\theta tries to guess the scaled noise ξ\xi from the observation of XtX_t.

Generative models: training and sampling

Let us wrap everything up in this section.

Training: learning the score

The Denoising Diffusion Score Matching loss

Let τ\tau be a random time on [0,T][0,T] with density proportional to w(t)w(t); let ξ\xi be a standard Gaussian random variable. The DDPM theoretical objective is (θ)=E[1σˉτξrθ(τ,eμτX0+σˉτξ)2]. \ell(\theta) = \mathbb{E}\left[\frac{1}{\bar{\sigma}_\tau}\left|\xi - r_\theta(\tau, e^{-\mu_\tau}X_0 + \bar{\sigma}_\tau \xi )\right|^2\right].

Since we have access to samples (xi,ξi,τi)(x^i, \xi^i, \tau^i) (at the cost of generating iid samples ξi\xi^i from a standard Gaussian and τi\tau^i uniform over [0,T][0,T]), we get the empirical version:

^(θ)=1ni=1n[1σˉτξirθ(eμτxi+σˉτξi)2].\hat{\ell}(\theta) = \frac{1}{n}\sum_{i=1}^n \left[\frac{1}{\bar{\sigma}_\tau}|\xi^i - r_\theta(e^{-\mu_\tau}x^i + \bar{\sigma}_\tau \xi^i)|^2\right].

Up to the constants and the choice of the drift αt\alpha_t and variance σt\sigma_t, this is exactly the loss function (14) from the paper DDPM, for instance.

In practice, for image generations, the go-to choice for the architecture of rθr_\theta is a U-net, a special kind of convolutional neural networks with a downsampling phase, an upsampling phase, and skip-connections in between.

Sampling

Once the algorithm has converged to θ\theta, we get sθ(t,x)s_\theta(t,x) which is a proxy for logpt(x)\nabla \log p_t(x). Now, we simply plug this expression in the functions vtbv^{\mathrm{b}}_t if we want to solve the ODE (18) or wtbw^{\mathrm{b}}_t if we want to solve the SDE (19).

The ODE sampler solves y(t)=v^tb(y(t)) y'(t) = -\hat{v}^{\mathrm{b}}_t(y(t)) started at y(0)N(0,I)y(0) \sim \mathscr{N}(0,I), where v^tb(x)=σTt2sθ(Tt,x)αTtx\hat{v}^{\mathrm{b}}_t(x) = -\sigma_{T-t}^2 s_\theta(T-t,x) - \alpha_{T-t} x.
The SDE sampler solves dYt=w^tb(Yt)dt+2σt2dBtdY_t = \hat{w}^{\mathrm{b}}_t(Y_t)dt + \sqrt{2\sigma_t^2}dB_t started at Y0N(0,I)Y_0 \sim \mathscr{N}(0,I), where w^tb(x)=2σTt2sθ(Tt,x)+αTtx\hat{w}^{\mathrm{b}}_t(x) = 2\sigma_{T-t}^2 s_\theta(T-t,x) + \alpha_{T-t} x.

We must stress a subtle fact. Equations (8) and (10), or their backward counterparts, are exactly the same equation accounting for ptp_t. But since now we replaced logpt\nabla \log p_t by its approximation sθs_\theta, this is no longer the case for our two samplers: their probability densities are not the same. In fact, let us note qtode,qtsdeq^{\mathrm{ode}}_t,q^{\mathrm{sde}}_t the densities of y(t)y(t) and YtY_{t}; the first one solves a Transport Equation, the second one a Fokker-Planck equation, and these two equations are different.

Backward Equations for the samplers tqtode(x)=v^tb(x)qtode(x)q0ode=π\partial_t q^{\mathrm{ode}}_t(x) = \nabla \cdot \hat{v}^{\mathrm{b}}_t(x)q^{\mathrm{ode}}_t(x)\qquad \qquad q_0^{\mathrm{ode}} = \pi tqtsde(x)=[σTt2logqtsde(x)w^tb(x)]qtsde(x)q0sde=π\partial_t q^{\mathrm{sde}}_t(x) = \nabla \cdot [\sigma_{T-t}^2\nabla \log q^{\mathrm{sde}}_t(x) - \hat{w}^{\mathrm{b}}_t(x)]q^{\mathrm{sde}}_t(x) \qquad \qquad q_0^{\mathrm{sde}} = \pi

Importantly, the velocity σTt2logqtsde(x)w^tb(x)\sigma_{T-t}^2\nabla \log q^{\mathrm{sde}}_t(x) - \hat{w}^{\mathrm{b}}_t(x) is in general not equal to the velocity v^tb(x)\hat{v}^{\mathrm{b}}_t(x). They would be equal only in the case sθ(t,x)=logpt(x)s_\theta(t,x) = \nabla \log p_t(x).

Proof. Since y(t)y(t) is an ODE, it directly satisfies the transport equation with velocity v^tb\hat{v}^{\mathrm{b}}_t. Since YtY_t is an SDE, it satisfies the Fokker-Planck equation associated with the drift w^tb\hat{w}^{\mathrm{b}}_t, which in turn can be transformed in the transport equation shown above.

Special choices for αt\alpha_t and σt\sigma_t

Considerable work has been done (mostly experimentally) to find good functions αt,βt\alpha_t,\beta_t. Some choices seem to stand out.

A variational bound for the SDE sampler

Let s: [0,T]×RdRds : [0,T]\times \mathbb{R}^d \to \mathbb{R}^d be a smooth function, meant as a proxy for logpt\nabla \log p_t. Our goal is to quantify the difference between the sampled densities qtode,qtsdeq^{\mathrm{ode}}_t, q^{\mathrm{sde}}_t and ptb=pTtp^{\mathrm{b}}_t=p_{T-t}. It turns out that controlling the Fisher divergence E[logpt(X)s(t,X)2]\mathbb{E}[|\nabla \log p_t(X) - s(t,X)|^2] results in a bound for kl(pqTsde)\mathrm{kl}(p \mid q_T^{\mathrm{sde}}), but not for kl(pqTode)\mathrm{kl}(p \mid q_T^{\mathrm{ode}}).

Small recap on notations

The true density is ptb=pTtp^{\mathrm{b}}_t = p_{T-t}, it satisfies the backward equation (14):

tptb(x)=vtb(x)ptb(x)vtb(x)=σTt2logptb(x)αTtx. \partial_t p^{\mathrm{b}}_t(x) = \nabla \cdot v^{\mathrm{b}}_t(x)p^{\mathrm{b}}_t(x)\qquad \qquad v^{\mathrm{b}}_t(x) = -\sigma_{T-t}^2\nabla \log p^{\mathrm{b}}_t(x) - \alpha_{T-t}x.

The density of the generative process is qtsdeq^{\mathrm{sde}}_t, but we'll simply note qtq_t. It satisfies the backward equation (37)

tqt(x)=ut(x)qt(x)\partial_t q_t(x) = \nabla\cdot u_t(x)q_t(x)

where

ut(x)=σTt2logqt(x)2σTt2s(t,x)αTtx. u_t(x) = \sigma_{T-t}^2\nabla \log q_t(x) - 2\sigma_{T-t}^2s(t,x) - \alpha_{T-t}x.

The original distribution we want to sample is p=p0=pTbp = p_0 = p^{\mathrm{b}}_T, and the output distribution of our SDE sampler is qTsde=qTq^{\mathrm{sde}}_T = q_T. Finally, the distribution pT=p0bp_T = p_0^{\mathrm{b}} is approximated with π\pi (in practice, N(0,I)\mathscr{N}(0,I)).

The KL divergence between densities ρ1,ρ2\rho_1, \rho_2 is

kl(ρ1ρ2)=ρ2(x)log(ρ2(x)/ρ1(x))dx. \mathrm{kl}(\rho_1 \mid \rho_2) = \int \rho_2(x)\log(\rho_2(x)/ \rho_1(x))dx.

A variational lower-bound

This theorem restricts to the case where the weights w(t)w(t) are constant, and for simplicity, they are set to 1.

Variational lower-bound for score-based diffusion models with SDE sampler

kl(pqTsde)kl(pTπ)+0Tσt2E[logpt(Xt)s(t,Xt)2]dt. \mathrm{kl}(p \mid q_T^{\mathrm{sde}}) \leqslant \mathrm{kl}(p_T \mid \pi) +\int_0^T \sigma^2_{t} \mathbb{E}[ |\nabla \log p_t(X_t) - s(t,X_t)\vert^2 ] dt.

The original proof can be found in this paper and uses the Girsanov theorem applied to the SDE representations (1)-(2) of the forward/backward process. This is utterly complicated and is too dependent on the SDE representation. The proof presented below only needs the Fokker-Planck equation and is done directly at the level of probability densities.

The following lemma is interesting on its own since it gives an exact expression for the KL divergence between transport equations.

ddtkl(ptbqt)=σTt2ptb(x)log(ptb(x)qt(x))(ut(x)vtb(x))dx\frac{d}{dt}\mathrm{kl}(p^{\mathrm{b}}_t \mid q_t) = \sigma^2_{T-t} \int p^{\mathrm{b}}_t(x) \nabla \log\left(\frac{p^{\mathrm{b}}_t(x)}{q_t(x)}\right) \cdot \left(u_t(x)- v^{\mathrm{b}}_t(x) \right)dx

In our case with the specific shape assumed by ut,vtbu_t, v^{\mathrm{b}}_t, we get the following bound:

ddtkl(ptqt)σTt2ptb(x)s(t,x)logptb(x)2dx\begin{aligned}\frac{d}{dt}\mathrm{kl}(p_t \mid q_t) \leqslant \sigma^2_{T-t}\int p^{\mathrm{b}}_t(x) |s(t,x) - \nabla \log p^{\mathrm{b}}_t(x) |^2 dx \end{aligned}

The proofs of (42)-(43)-(44) are only based on elementary manipulations of time-evolution equations.

Proof of (43).

A small differentiation shows that ddtkl(ptbqt) \frac{d}{dt}\mathrm{kl}(p^{\mathrm{b}}_t \mid q_t) is equal to

(vtb(x)ptb(x))log(ptb(x)/qt(x))dx+ptb(x)(vtb(x)ptb(x))ptb(x)dxptb(x)(ut(x)qt(x))qt(x)dx.\int \nabla \cdot (v^{\mathrm{b}}_t(x)p^{\mathrm{b}}_t(x))\log(p^{\mathrm{b}}_t(x)/q_t(x))dx + \int p^{\mathrm{b}}_t(x)\frac{\nabla \cdot (v^{\mathrm{b}}_t(x)p^{\mathrm{b}}_t(x))}{p^{\mathrm{b}}_t(x)}dx - \int p^{\mathrm{b}}_t(x)\frac{\nabla \cdot (u_t(x)q_t(x))}{q_t(x)} dx.

By an integration by parts, the first term is also equal to ptb(x)vtb(x)log(ptb(x)/qt(x))dx-\int p^{\mathrm{b}}_t(x)v^{\mathrm{b}}_t(x)\cdot \nabla \log(p^{\mathrm{b}}_t(x)/q_t(x))dx. For the second term, it is clearly zero. Finally, for the last one, ptb(x)(ut(x)qt(x))qt(x)dx=(ptb(x)/qt(x))ut(x)qt(x)dx=log(ptb(x)/qt(x))ut(x)ptb(x)dx.\begin{aligned} - \int p^{\mathrm{b}}_t(x)\frac{\nabla \cdot (u_t(x)q_t(x))}{q_t(x)} dx &= \int \nabla (p^{\mathrm{b}}_t(x)/q_t(x)) \cdot u_t(x)q_t(x)dx \\ &= \int \nabla \log(p^{\mathrm{b}}_t(x)/q_t(x))\cdot u_t(x)p^{\mathrm{b}}_t(x)dx. \end{aligned}

Proof of (44). We recall that

ut(x)=σTt2logqt(x)2σTt2s(t,x)αTtxu_t(x) = \sigma^2_{T-t}\nabla \log q_t(x) - 2\sigma^2_{T-t}s(t,x) - \alpha_{T-t}x

and

vtb(x)=σTt2logptb(x)αTtx,v^{\mathrm{b}}_t(x) = -\sigma^2_{T-t}\nabla \log p^{\mathrm{b}}_t(x) - \alpha_{T-t}x,

so that utvtb=σTt2logqt2σTt2s+σTt2logptb=σTt2(logqtlogptb+2(logptbs)).\begin{aligned} u_t - v^{\mathrm{b}}_t &= \sigma^2_{T-t}\nabla \log q_t - 2\sigma^2_{T-t}s + \sigma^2_{T-t}\nabla \log p^{\mathrm{b}}_t\\ &= \sigma^2_{T-t} \left( \nabla \log q_t - \nabla \log p^{\mathrm{b}}_t + 2 (\nabla \log p^{\mathrm{b}}_t - s) \right).\end{aligned} We momentarily note a=logptb(x)a = \nabla \log p^{\mathrm{b}}_t(x) and b=logqt(x)b = \nabla \log q_t(x) and s=s(t,x)s=s(t,x). Then, (43) shows that ddtkl(ptbqt)=σTt2ptb(x)(ab)((ba)+2(sa))dx=σTt2pt(x)ab2dx+2σTt2pt(x)(ab)(sa)dx.\begin{aligned} \frac{d}{dt}\mathrm{kl}(p^{\mathrm{b}}_t \mid q_t) &= \sigma^2_{T-t}\int p^{\mathrm{b}}_t(x)(a - b)\cdot ((b-a) + 2(s - a))dx\\ &= - \sigma^2_{T-t}\int p_t(x)|a-b|^2 dx + 2 \sigma^2_{T-t}\int p_t(x)(a-b)\cdot (s-a)dx. \end{aligned} We now use the classical inequality 2(xy)x2+y22(x\cdot y) \leqslant |x|^2 + |y|^2; we get

ddtkl(ptbqt)σTt2ptb(x)s(t,x)logptb(x)2dx. \frac{d}{dt}\mathrm{kl}(p^{\mathrm{b}}_t \mid q_t) \leqslant \sigma^2_{T-t} \int p^{\mathrm{b}}_t(x)|s(t,x) - \nabla\log p^{\mathrm{b}}_t(x)|^2dx.

Proof of (42).

Now, we simply write kl(pTbqTsde)kl(p0bq0sde)=0Tddtkl(ptbqt)dt\begin{aligned} \mathrm{kl}(p^{\mathrm{b}}_T \mid q^{\mathrm{sde}}_T) - \mathrm{kl}(p^{\mathrm{b}}_0 \mid q_0^{\mathrm{sde}}) &= \int_0^T \frac{d}{dt}\mathrm{kl}(p^{\mathrm{b}}_t \mid q_t) dt \end{aligned} and plug (44) inside the RHS. Here q0=πq_0 = \pi and pTb=pp^{\mathrm{b}}_T= p, hence the result.

What about the ODE ?

It turns out that the ODE solver, whose density is qtodeq^{\mathrm{ode}}_t, does not have such a nice upper bound. In fact, since qtodeq^{\mathrm{ode}}_t solves a Transport Equation, we can still use (43) but with utu_t replaced with v^tb\hat{v}^{\mathrm{b}}_t, and integrate in tt just as in (52). We have

v^tb(x)vtb(x)=logptb(x)s(t,x)=logptb(x)logqt(x)+logqt(x)s(t,x).\begin{aligned}\hat{v}^{\mathrm{b}}_t(x) - v^{\mathrm{b}}_t(x) &= \nabla \log p^{\mathrm{b}}_t(x)-s(t,x) \\ &= \nabla \log p^{\mathrm{b}}_t(x) - \nabla\log q_t(x) + \nabla\log q_t(x) - s(t,x). \end{aligned}

Using the Cauchy-Schwarz inequality, we could obtain the following upper bound.

kl(pqTode)kl(pTπ)0Tpt(x)logpt(x)logqt(x)2+pt(x)logqt(x)s(t,x)2dxdt0TE[logpt(Xt)logqt(Xt)2+logqt(Xt)s(t,Xt)2]dt.\begin{aligned} \mathrm{kl}(p \mid q_T^{\mathrm{ode}}) - \mathrm{kl}(p_T \mid \pi) &\leqslant \int_0^T \int p_t(x)\left|\nabla\log p_t(x) - \nabla\log q_t(x)\right|^2 + p_t(x)\left|\nabla \log q_t(x) - s(t,x)\right|^2 dx dt\\ &\leqslant \int_0^T \mathbb{E}\left[|\nabla\log p_t(X_t) - \nabla\log q_t(X_t)|^2 + |\nabla\log q_t(X_t) - s(t,X_t)|^2\right]dt. \end{aligned}

There is a significant difference between the score matching objective function and the SDE version. Minimizing the former does not minimize the upper bound, whereas the latter does. This disparity is due to the Fisher divergence, which does not provide control over the KL divergence between the solutions of two transport equations. However, it does regulate the KL divergence between the solutions of the associated Fokker-Planck equations, thanks to the presence of a diffusive term. This could be one of the reasons for the lower performance of ODE solvers that was observed by early experimenters in the field. However, more recent works (see the references just below) seemed to challenge this idea. With different dynamics than the Ornstein-Uhlenbeck one, deterministic sampling techniques like ODEs seem now to outperform the stochastic one. A complete understanding of these phenomena is not available yet; the outstanding paper on stochastic interpolants proposes a remarkable framework towards this task (and inspired most of the analysis in this note).

References

On diffusion models

The original paper on diffusion models

DDPM (seminal paper for image generation)

Diffusion beat GANs (pushing diffusions well beyond the SOTA)

Variational perspective on Diffusions or arxiv (the analytical SDE approach)

Maximum likelihood training of Diffusions (proofs of the variational lower-bound)

Sampling is as easy as learning the score (theoretical analysis under minimal assumptions)

Beyond diffusions

Diffusion Schrodinger Bridge

Probability flow for FP

Flow matching paper

Stochastic interpolants

Consistency models

Rectified Flow