These series of notes focus on diffusion-based generative models, like the celebrated Denoising Diffusion Probabilistic Models; they contain the material I regularly present as lectures in some working groups for mathematicians or math graduate students, so the style is tailored for this audience. In particular, everything is fitted into the continuous-time framework (which is not exactly how it is done in practice).
A special attention is given to the differences between ODE sampling and SDE sampling. The analysis of the time evolution of the densities pt is done using only Fokker-Planck Equations or Transport Equations.
Let p be a probability density on Rd. The goal of generative modelling is twofold: given samples x1,…,xn from p, we want to estimate p and generate new samples from p.
Many methods were designed for tackling these challenges: Energy-Based Models, Normalizing Flows and the famous Neural ODEs, vanilla Score-Matching, GANs. Each has its limitations: for example, EBMs are challenging to train, NFs lack expressivity and SM fails to capture multimodal distributions. Diffusion models, and their successors flow-based models, offer sufficient flexibility to partially overcome these limitations. Their ability to be guided towards conditional generations using Classifier-Free Guidance is also a major advantage, and will be reviewed in the third note of the series.
Diffusion models fall into the framework of stochastic interpolants. The idea is to continuously transform the density p into another easy-to-sample density π (often called the target), while also transforming the samples xi from p into samples from π; and then, to reverse the process: that is, to generate a sample from π, and to inverse the transformation to get a new sample from p. In other words, we seek a path (pt:t∈[0,T]) with p0=p and pT=q, such that generating samples xt∼pt is doable.
The success of diffusion models came from the realization that some stochastic processes, such as Ornstein-Uhlenbeck processes that connect p0 with a distribution pT very close to pure noise N(0,I), can be reversed when the score function∇logpt is available at each time t. Although unknown, this score can efficiently be learnt using statistical procedures called score matching.
Let (t,x)→ft(x) and t→wt be two smooth functions. Consider the stochastic differential equation
dXt=ft(Xt)dt+2wt2dBt,X0∼p
where dBt denotes integration with respect to a Brownian motion. Under mild conditions on f, an almost-surely continuous stochastic process satisfying this SDE exists. Let pt be the probability density of Xt; it is a non-trivial fact that this process can be reversed in time. More precisely, the following SDE is exactly the time-reversal of (1):
Theorem (Anderson, 1982). The law of the stochastic process (Yt)t∈[0,T] is the same as the law of (XT−t)t∈[0,T].
While this theorem is not easy to prove, we will later check using the Fokker-Planck equation that the marginalspT−t of the process XT−t are indeed the same as the marginals qT−t of the process Yt.
Sampling paths from the reverse SDE (2) needs access to ∇logpt. The key point of diffusion-like methods is that this quantity can be estimated.
For simple functions f, the process (1) has an explicit representation. Here we focus on the case where ft(x)=−μtx for some function μ called drift coefficient, that is
dXt=−μtXt+2wt2dBt.
Define αt=e−At where At=∫0tμsds. Then, the solution of (3) is given by the following stochastic process: Xt=αtX0+2∫0teAs−AtwsdBs. In particular, noting
σˉt2=2∫0te−2∫stμuduws2ds
we have
Xt=lawαtX0+σˉtε
where ε∼N(0,1).
In the pure Orstein-Uhlenbeck case where wt=w and μt=1, we get αt=e−t and Xt=e−tX0+N(0,1−e−2t).
Proof of (4). We set F(x,t)=xeAt and Yt=F(Xt,t)=XteAt; it turns out that Yt satisfies a nicer SDE. Since Δxf=0, ∂tf(x,t)=xeAtμt and ∇xf(x,t)=eAt, Itô's formula says that dYt=∂tF(t,Xt)dt+∇xF(t,Xt)dXt+22wt2ΔxF(t,Xt)dt=XteAtμtdt+eAtdXt=2wt2e2AtdBt. Consequently, Yt=Y0+∫0t2ws2e2AsdBs and the result holds when we multiply everything by e−At.
The second term in (4) reduces to a Wiener Integral; it is a centered Gaussian with variance 2∫0te2(As−At)ws2ds, hence Xt=lawe−AtX0+N(0,2∫0te2As−2Atws2ds).
A consequence of the preceding result is that when the variance σˉt2 is big compared to αt, then the distribution of Xt is well-approximated by N(0,σˉt2). Indeed, for μt=wt=1, we have αT=e−T and σˉT=1−e−2T≈1 if T is sufficiently large, like T>10.
It has recently been recognized that the Ornstein-Uhlenbeck representation of pt as in (1), as well as the stochastic process (2) that has the same marginals as pt, are not necessarily unique or special. Instead, what matters are two key features: (i) pt provides a path connecting p and pT≈N(0,I), and (ii) its marginals are easy to sample. There are other processes besides (1) that have pt as their marginals, and that can also be reversed. The crucial point is that pt is a solution of the Fokker-Planck equation:
∂tpt(x)=Δ(wt2pt(x))−∇⋅(ft(x)pt(x)).
Just to settle the notations once and for all: ∇ is the gradient, and for a function ρ:Rd→Rd, ∇⋅ρ(x) stands for the divergence, that is ∑i=1d∂xiρ(x1,…,xd), and ∇⋅∇=Δ=∑i=1d∂xi2 is the Laplacian.
Proof (informal). For a compactly supported smooth test function φ, we have ∂tE[φ(Xt)]=∫φ(x)∂tpt(x)dx. On the other hand, this quantity is also equal to E[dφ(Xt)]. Itô's formula says that dφ(Xt)=∇φ(Xt)⋅dXt+21Δφ(Xt)dt, which is also equal to ∇φ(Xt)ft(Xt)+Mt+21Δφ(Xt)dt, where Mt is a Brownian martingale started at 0, whose exapectation is thus 0. Gathering everything, we see that E[dφ(Xt)] is also equal to E[∇φ(Xt)⋅ft(Xt)dt+21Δφ(Xt)dt], that is
∫∇φ(x)⋅ft(x)pt(x)dx+21∫Δφ(x)wt2pt(x)dx.
One integration by parts on the first integral, and two on the second, lead to the expression
∫φ(x)[−∇⋅(pt(x)ft(x))+2wt2Δpt(x)]dx.
Comparing this with the first expression for ∂tE[φ(Xt)] gives the result.
Importantly, equation (9) can be recast as a transport equation: with a velocity field defined as
Proof.∇⋅vt(x)pt(x)=∇⋅(wt2∇logpt(x))pt(x)−∇⋅ft(x)pt(x)=wt2∇⋅∇pt(x)−∇⋅ft(x)pt(x), and since ∇⋅∇=Δ, this is equal to wt2Δpt(x)−∇⋅ft(x)pt(x)=∂tpt(x).
Equations like (13) are called transport equations or continuity equations. They come from simple ODEs; that is, there is a deterministic process with the same marginals as (1).
Let x(t) be the solution of the differential equation with random initial condition x′(t)=−vt(x(t))x(0)=X0. Then the probability density of x(t) satisfies (13), hence it is equal to pt.
Proof. Let pt be the probability density of x(t) and let φ be any smooth, compactly supported test function. Then, E[φ(x(t))]=∫pt(x)φ(x)dx, so by derivation under the integral, ∫∂tpt(x)φ(x)dx=∂tE[φ(x(t))]=E[∇φ(x(t))x′(t)]=−∫∇φ(x)vt(x)pt(x)dx=∫φ(x)∇⋅(vt(x)pt(x))dx where the last line uses the multidimensional integration by parts formula.
Up to now, we proved that there are two continuous random processes having the same marginal probability density at time t: a smooth one provided by x(t), the solution of the ODE, and a continuous but not differentiable one, Xt, provided by the solution of the SDE.
We now have various processes x(t),Xt starting at a density p0 and evolving towards a density pT≈π=N(0,I). Can these processes be reversed in time? The answer is yes for both of them. We'll start by reversing their associated equations. From now on, we will note ptb the time-reversal of pt, that is:
ptb(x)=pT−t(x).
The density ptb solves the backward Transport Equation: ∂ptb(x)=∇⋅vtb(x)ptb(x) where
vtb(x)=−vt(x)=−wT−t2∇logpt(x)−μT−tx.
The density ptb also solves the backward Fokker-Planck Equation: ∂ptb(x)=wT−t2Δptb(x)−∇⋅utb(x)ptb(x) where
utb(x)=2wT−t2∇logptb(x)+μT−tx.
Proof. Noting p˙t(x) the time derivative of t↦pt(x) at time t, we immediately see that ∂tptb(x)=−p˙T−t(x) and the rest is a mere verification.
Of course, these two equations are the same, but they represent the time-evolution of the density of two different random processes. As explained before, the Transport version (17) represents the time-evolution of the density of the ODE system
y′(t)=−vtb(y(t))y(0)∼pT
while the Fokker-Planck version (19) represents the time-evolution of the SDE system
dYt=utb(Yt)dt+2wT−t2dBtY0∼pT.
Both of these two processes can be sampled using a range of ODE and SDE solvers, the simplest of which being the Euler scheme and the Euler-Maruyama scheme. However, this requires access to the functions vtb and wtb, which in turn depend on the unknown score ∇logpt. Fortunately, ∇logpt can efficiently be estimated due to two factors.
First: we have samples from pt. Remember that our only information about p is a collection x1,…,xn of samples. Thanks to the representation (8), we can represent xti=αtxi+σˉtξi with ξi∼N(0,I) are samples from pt. They are extremely easy to access, since we only need to generate iid standard Gaussian variables ξi.
Second: score matching. If p is a probability density and xi are samples from p, estimating ∇logp (called score) has been thoroughly examined and is fairly doable, a technique known as score matching.
Learning the score∇xlnp(x) of a probability density p is a well-known problem in statistics, and is somehow orthogonal to the world of generative flow models. I gathered the main ideas in the next note on flow matching and Tweedie-s formula. In short, it turns out that training a neural network s(t,x) to denoiseXt (that is, to remove the added noise εt, where Xt=αtX0+εt) with the loss E[∣s(t,Xt)−εt∣2] directly leads to an estimator of the score,
In practice, for image generations, the go-to choice for the architecture of sθ was first chosen to be a U-net, a special kind of convolutional neural networks with a downsampling phase, an upsampling phase, and skip-connections in between. After 2023 it seemed that everyone switched to pure-transformers models, following the landmark DiT paper from Peebles and Xie.
Once the algorithm has converged to θ, we get sθ(t,x) which is a proxy for ∇logpt(x) (we absorbed the constant σˉt2 into the definition of s). Now, we simply plug this expression in the functions vtb if we want to solve the ODE (21) or wtb if we want to solve the SDE (22).
The ODE sampler (DDIM) solves y′(t)=−v^tb(y(t)) started at y(0)∼N(0,I),
where v^tb(x)=−wT−t2sθ(T−t,x)−μT−tx.
The SDE sampler (DDPM) solves dYt=u^tb(Yt)dt+2wt2dBt started at Y0∼N(0,I), where u^tb(x)=2wT−t2sθ(T−t,x)+μT−tx.
We must stress a subtle fact. Equations (9) and (13), or their backward counterparts, are exactly the same equation accounting for pt. But since now we replaced ∇logpt by its approximationsθ, this is no longer the case for our two samplers: their probability densities are not the same. In fact, let us note qtode,qtsde the densities of y(t) and Yt; the first one solves a Transport Equation, the second one a Fokker-Planck equation, and these two equations are different.
Backward Equations for the samplers∂tqtode(x)=∇⋅v^tb(x)qtode(x)q0ode=π∂tqtsde(x)=∇⋅[wT−t2∇logqtsde(x)−u^tb(x)]qtsde(x)q0sde=π
Importantly, the velocity wT−t2∇logqtsde(x)−u^tb(x) is in general not equal to the velocity v^tb(x). They would be equal only in the case sθ(t,x)=∇logpt(x).
Proof. Since y(t) is an ODE, it directly satisfies the transport equation with velocity v^tb. Since Yt is an SDE, it satisfies the Fokker-Planck equation associated with the drift u^tb, which in turn can be transformed in the transport equation shown above.
We showed that the solution of this equation at time t has the same distribution as αtXt+σˉtε where ε∼N(0,1). Here, the αt,σˉt are related to μt,wt by
αt=exp{−∫0tμsds}σˉt2=2∫0te−2∫stμuduws2ds.
Considerable work has been done (mostly experimentally) to find good functions μt,wt. Some choices seem to stand out.
The VE path takes μt=0 (that is, no drift) and wt a continuous, increasing function over [0,1), such that σ0=0 and σ1=+∞; typically, wt=(1−t)−1. This gives parameters
The VP takes wt=μt. In this case we see that in this case, αt=e−∫0tμsds and
σˉt2=2∫0te−2∫stμuduμsds=1−e−2∫0tμsds.
The name « variance preserving » comes from the fact that the element-wise variance of Xt=αtX0+σˉtε is exactly αt2+σˉt2, which in this case is equal to 1 (we supposed without loss of generality that X0 had been standardized to have element-wise variance 1).
The design choice for a diffusion reduces to the drift and diffusion coefficients μt,wt. These choices restrict the actual variances αt,σˉt. But there might be a way to directly choose αt,σˉt and specify paths having distribution αtX0+σˉtε. Typically, we would like to choose
αt=cos(πt/2),σˉt2=sin(πt/2).
Of course we could find μt,wt to solve these equations, but this is weird. This decoupling is done through stochastic interpolation and will be reviewed in the third note of the series.
Let s:[0,T]×Rd→Rd be a smooth function, meant as a proxy for ∇logpt. Our goal is to quantify the difference between the sampled densities qtode,qtsde and ptb=pT−t. It turns out that controlling the Fisher divergence E[∣∇logpt(X)−s(t,X)∣2] results in a bound for kl(p∣qTsde), but not for kl(p∣qTode).
The density of the generative process is qtsde, but we'll simply note qt. It satisfies the backward equation (25)
∂tqt(x)=∇⋅ut(x)qt(x)
where
ut(x)=wT−t2∇logqt(x)−2wT−t2s(t,x)−μT−tx.
The original distribution we want to sample is p=p0=pTb, and the output distribution of our SDE sampler is qTsde=qT. Finally, the distribution pT=p0b is approximated with π (in practice, N(0,I)).
The original proof can be found in this paper and uses the Girsanov theorem applied to the SDE representations (1)-(2) of the forward/backward process. This is utterly complicated and is too dependent on the SDE representation. The proof presented below only needs the Fokker-Planck equation and is done directly at the level of probability densities.
The following lemma is interesting on its own since it gives an exact expression for the KL divergence between transport equations.
By an integration by parts, the first term is also equal to −∫ptb(x)vtb(x)⋅∇log(ptb(x)/qt(x))dx. For the second term, it is clearly zero. Finally, for the last one, −∫ptb(x)qt(x)∇⋅(ut(x)qt(x))dx=∫∇(ptb(x)/qt(x))⋅ut(x)qt(x)dx=∫∇log(ptb(x)/qt(x))⋅ut(x)ptb(x)dx.
so that ut−vtb=wT−t2∇logqt−2wT−t2s+wT−t2∇logptb=wT−t2(∇logqt−∇logptb+2(∇logptb−s)). We momentarily note a=∇logptb(x) and b=∇logqt(x) and s=s(t,x). Then, (38) shows that dtdkl(ptb∣qt)=σT−t2∫ptb(x)(a−b)⋅((b−a)+2(s−a))dx=−wT−t2∫pt(x)∣a−b∣2dx+2wT−t2∫pt(x)(a−b)⋅(s−a)dx. We now use the classical inequality 2(x⋅y)⩽∣x∣2+∣y∣2; we get
It turns out that the ODE solver, whose density is qtode, does not have such a nice upper bound. In fact, since qtode solves a Transport Equation, we can still use (38) but with ut replaced with v^tb, and integrate in t just as in (47). We have
There is a significant difference between the score matching objective function and the SDE version. Minimizing the former does not minimize the upper bound, whereas the latter does. This disparity is due to the Fisher divergence, which does not provide control over the KL divergence between the solutions of two transport equations. However, it does regulate the KL divergence between the solutions of the associated Fokker-Planck equations, thanks to the presence of a diffusive term. This could be one of the reasons for the lower performance of ODE solvers that was observed by early experimenters in the field. However, more recent works (see the references just below) seemed to challenge this idea. With different dynamics than the Ornstein-Uhlenbeck one, deterministic sampling techniques like ODEs seem now to outperform the stochastic one. A complete understanding of these phenomena is not available yet; the outstanding paper on stochastic interpolants proposes a remarkable framework towards this task (and inspired most of the analysis in this note).