These notes focus on diffusion-based generative models, like the celebrated Denoising Diffusion Probabilistic Models; the material was presented as a series of lectures I gave at some working groups of mathematicians, so the style is tailored for this audience. In particular, everything is fitted into the continuous-time framework (which is not how it is done in practice).
A special attention is given to the differences between ODE sampling and SDE sampling. The analysis of the time evolution of the densities is done using only Fokker-Planck Equations or Transport Equations.
Let be a probability density on . The goal of generative modelling is twofold: given samples from , we want to
generate new samples from .
There are various methods for tackling these challenges: Energy-Based Models, Normalizing Flows and the famous Neural ODEs, vanilla Score-Matching. However, each method has its limitations. For example, EBMs are very challenging to train, NFs lack expressivity and SM fails to capture multimodal distributions. Diffusion models offer sufficient flexibility to (partially) overcome these limitations.
Diffusion models fall into the general framework of stochastic interpolants. The central idea is to continuously transform the density into another easy-to-sample density (often called the target), while also transforming the samples from into samples from ; and then, to reverse the process: that is, to generate a sample from , and to inverse the transformation to get a new sample from . In other words, we seek a path with and , such that generating samples is doable.
The success of diffusion models came from the realization that some stochastic processes, such as Ornstein-Uhlenbeck processes that connect with a distribution very close to pure noise , can be reversed when the score function is available at each time . Although unknown, this score can efficiently be learnt using statistical procedures called score matching.
Let and be two smooth functions. Consider the stochastic differential equation
where denotes integration with respect to a Brownian motion. Under mild conditions on , an almost-surely continuous stochastic process satisfying this SDE exists. Let be the probability density of ; it is known that this process could easily be reversed in time. More precisely, the SDE
has the same marginals as reversed in time: more precisely has the same distribution as , with density noted . This inversion needs access to , and we'll explain later how this can be done.
For simple functions , the process (1) has an explicit representation. Here we focus on the case where for some function , that is
. Then, the solution of (3)
is given by the following stochastic process:
In particular, the second term reduces to a Wiener Integral; it is a centered Gaussian with variance , hence
In the pure Orstein-Uhlenbeck case where and , we get and .
Proof of (4).
; it turns out that
satisfies a nicer SDE. Since
, Itô's formula
and the result holds.
A consequence of the preceding result is that when the variance
is big compared to , then the distribution of is well-approximated by . Indeed, for , we have if is sufficiently large.
It has recently been recognized that the Ornstein-Uhlenbeck representation of as in (1), as well as the stochastic process (2) that has the same marginals as , are not necessarily unique or special. Instead, what matters are two key features: (i) provides a path connecting and , and (ii) its marginals are easy to sample. There are many other processes besides (1) that have as their marginals, and that can also be reversed. The crucial point is that is a solution of the Fokker-Planck equation:
Just to settle the notations once and for all: is the gradient, and for a function , stands for the divergence, that is , and is the Laplacian.
Importantly, equation (8) can be recast as a transport equation: with a velocity field defined as
the equation (8) is equivalent to
Transport equations like (10) come from simple ODEs; that is, there is a deterministic process with the same marginals as (1).
be the solution of the differential equation with random initial condition
Then the probability density of
, hence it is equal to
be the probability density of
be any smooth, compactly supported test function. Then,
, so by derivation under the integral,
where the last line uses the multidimensional integration by parts formula.
Up to now, we proved that there are two continuous random processes having the same marginal probability density at time : a smooth one provided by , the solution of the ODE, and a continuous but not differentiable one, , provided by the solution of the SDE.
We now have various processes starting at a density and evolving towards a density . Can these processes be reversed in time? The answer is yes for both of them. We'll start by reversing their associated equations. From now on, we will note the time-reversal of , that is:
The density solves the backward Transport Equation: where
The density also solves the backward Fokker-Planck Equation: where
the time derivative of
, we immediately see that
and the rest is a mere verification.
Of course, these two equations are the same, but they represent the time-evolution of the density of two different random processes. As explained before, the Transport version (14) represents the time-evolution of the density of the ODE system
while the Fokker-Planck version (16) represents the time-evolution of the SDE system
Both of these two processes can be sampled using a range of ODE and SDE solvers, the simplest of which being the Euler scheme and the Euler-Maruyama scheme. However, this requires access to the functions and , which in turn depend on the unknown score . Fortunately, can efficiently be estimated due to two factors.
First: we have samples from . Remember that our only information about is a collection of samples. But thanks to the representation (5), we can represent with are samples from . They are extremely easy to access, since we only need to generate iid standard Gaussian variables .
Second: score matching. If is a probability density and are samples from , estimating (called score) has been thoroughly examined and is fairly doable, a technique known as score matching.
The L2-distance between the scores of two probability densities is often called the Fisher divergence:
Since our goal is to learn , it is natural to choose a parametrized family of functions and to optimize so that the divergence
is as small as possible. However, this optimization problem is intractable, due to the presence of the explicit form of inside the integral. This is where Score Matching techniques come into play.
Let be a smooth probability density function supported over and let be a random variable with density . The following elementary identity is due to Hyvärinen, 2005; it is the basis for score matching estimation in statistics.
be any smooth function with sufficiently fast decay at
is a constant not depending on
Proof. We start by expanding the square norm: The first term does not depend on , it is our constant . For the last term, we use then we use the integration-by-parts formula:
and the identity is proved.
Now, (22) is particularly interesting for us. Remember that if we want to reverse (11), we do not really need to estimate but only . We do so by approximating it using a parametrized family of functions, say (typically, a neural network):
First, we need not solve this optimization problem for every . We could obviously discretize with and only solve for independently, but it is actually smarter and cheaper to approximate the whole function by a single neural network (a U-net, in general). That is, we use a parametrized family . This enforces a form of time-continuity which seems natural. Now, since we want to aggregate the losses at each time, we solve the following problem:
where is a weighting function (for example, can be higher for , since we don't really care about approximating as precisely as ).
In the preceding formulation we cannot exactly compute the expectation with respect to , but we can approximate it with our samples . Additionnaly, we need to approximate the integral, for instance we can discretize the time steps with . Our objective function becomes
which looks computable… except it's not ideal. Suppose we perform a gradient descent on to find the optimal for time . Then at each gradient descent step, we need to evaluate as well as its divergence; and then compute the gradient in of the divergence in , in other words to compute . In high dimension, this can be too costly.
Fortunately, there is another way to perform score matching when is the distribution of a random variable with gaussian noise added, as in our setting. We'll present this result in a fairly abstract setting; we suppose that is a density function, and where is an other density. The following result is due to Vincent, 2010.
Denoising Score Matching Objective
Let be a smooth function. Let be a random variable with density , an independent random variable with density , and , whose density is . Then, where is a constant not depending on .
Proof. By the same computation as for vanilla score matching, we have
Now by definition, , hence , and the last term above is equal to This last term is equal to . But then, upon adding and subtracting the term which does not depend on , we get another constant such that
Now, this Denoising Score Matching loss does not involve any computation of a « double gradient » like .
Let us apply this to our setting. Remember that is the density of where , hence in this case and . The objective in (26) becomes
This can be further simplified. Indeed, let us slightly change the parametrization and use . Then,
Intuitively, the neural network tries to guess the scaled noise from the observation of .
Let us wrap everything up in this section.
The Denoising Diffusion Score Matching loss
Let be a random time on with density proportional to ; let be a standard Gaussian random variable. The DDPM theoretical objective is
Since we have access to samples (at the cost of generating iid samples from a standard Gaussian and uniform over ), we get the empirical version:
Up to the constants and the choice of the drift and variance , this is exactly the loss function (14) from the paper DDPM, for instance.
In practice, for image generations, the go-to choice for the architecture of is a U-net, a special kind of convolutional neural networks with a downsampling phase, an upsampling phase, and skip-connections in between.
Once the algorithm has converged to , we get which is a proxy for . Now, we simply plug this expression in the functions if we want to solve the ODE (18) or if we want to solve the SDE (19).
The ODE sampler
The SDE sampler
We must stress a subtle fact. Equations (8) and (10), or their backward counterparts, are exactly the same equation accounting for . But since now we replaced by its approximation , this is no longer the case for our two samplers: their probability densities are not the same. In fact, let us note the densities of and ; the first one solves a Transport Equation, the second one a Fokker-Planck equation, and these two equations are different.
Backward Equations for the samplers
Importantly, the velocity is in general not equal to the velocity . They would be equal only in the case .
is an ODE, it directly satisfies the transport equation with velocity
is an SDE, it satisfies the Fokker-Planck equation associated with the drift
, which in turn can be transformed in the transport equation shown above.
Considerable work has been done (mostly experimentally) to find good functions . Some choices seem to stand out.
the Variance Exploding path takes (that is, no drift) and a continuous, increasing function over , such that and ; typically, .
the Variance-Preserving path takes .
the pure Ornstein-Uhlenbeck path takes , it is a special case of the previous one, mostly suitable for theoretical purposes.
Let be a smooth function, meant as a proxy for . Our goal is to quantify the difference between the sampled densities and . It turns out that controlling the Fisher divergence results in a bound for , but not for .
The true density is , it satisfies the backward equation (14):
The density of the generative process is , but we'll simply note . It satisfies the backward equation (37)
The original distribution we want to sample is , and the output distribution of our SDE sampler is . Finally, the distribution is approximated with (in practice, ).
The KL divergence between densities is
This theorem restricts to the case where the weights are constant, and for simplicity, they are set to 1.
Variational lower-bound for score-based diffusion models with SDE sampler
The original proof can be found in this paper and uses the Girsanov theorem applied to the SDE representations (1)-(2) of the forward/backward process. This is utterly complicated and is too dependent on the SDE representation. The proof presented below only needs the Fokker-Planck equation and is done directly at the level of probability densities.
The following lemma is interesting on its own since it gives an exact expression for the KL divergence between transport equations.
In our case with the specific shape assumed by , we get the following bound:
The proofs of (42)-(43)-(44) are only based on elementary manipulations of time-evolution equations.
Proof of (43).
A small differentiation shows that is equal to
By an integration by parts, the first term is also equal to . For the second term, it is clearly zero. Finally, for the last one,
Proof of (44). We recall that
so that We momentarily note and and . Then, (43) shows that We now use the classical inequality ; we get
Proof of (42).
Now, we simply write and plug (44) inside the RHS. Here and , hence the result.
It turns out that the ODE solver, whose density is , does not have such a nice upper bound. In fact, since solves a Transport Equation, we can still use (43) but with replaced with , and integrate in just as in (52). We have
Using the Cauchy-Schwarz inequality, we could obtain the following upper bound.
There is a significant difference between the score matching objective function and the SDE version. Minimizing the former does not minimize the upper bound, whereas the latter does. This disparity is due to the Fisher divergence, which does not provide control over the KL divergence between the solutions of two transport equations. However, it does regulate the KL divergence between the solutions of the associated Fokker-Planck equations, thanks to the presence of a diffusive term. This could be one of the reasons for the lower performance of ODE solvers that was observed by early experimenters in the field. However, more recent works (see the references just below) seemed to challenge this idea. With different dynamics than the Ornstein-Uhlenbeck one, deterministic sampling techniques like ODEs seem now to outperform the stochastic one. A complete understanding of these phenomena is not available yet; the outstanding paper on stochastic interpolants proposes a remarkable framework towards this task (and inspired most of the analysis in this note).
The original paper on diffusion models
DDPM (seminal paper for image generation)
Diffusion beat GANs (pushing diffusions well beyond the SOTA)
Variational perspective on Diffusions or arxiv (the analytical SDE approach)
Maximum likelihood training of Diffusions (proofs of the variational lower-bound)
Sampling is as easy as learning the score (theoretical analysis under minimal assumptions)
Diffusion Schrodinger Bridge
Probability flow for FP
Flow matching paper