Let p be any smooth probability density function. Its score is the gradient of its log-density: ∇logp(x). This object is of paramount importance in many fields, like physics, statistics, and machine learning. In particular, it is the key to sample from p using the Langevin dynamics, and we've seen it to be the key when reversing diffusion processes. In this note, we survey the classical technique used for learning the score of a density from its samples: score matching.
Since our goal is to learn ∇logp(x), it is natural to choose a parametrized family of functions sθ and to optimize θ so that the divergence
∫p(x)∣∇logp(x)−sθ(x)∣2dx
is as small as possible. However, this optimization problem is intractable, due to the presence of the explicit form of p inside the integral. This is where Score Matching techniques come into play.
The score matching trick
Let p be a smooth probability density function supported over Rd and let X be a random variable with density p. The following elementary identity is due to Hyvärinen, 2005; it is the basis for score matching estimation in statistics.
Let s:Rd→Rd be any smooth function with sufficiently fast decay at ∞, and X∼p. Then, E[∣∇logp(X)−s(X)∣2]=c+E[∣s(X)∣2+2∇⋅s(X)] where c is a constant not depending on s.
Proof. We start by expanding the square norm: ∫p(x)∣∇logp(x)−s(x)∣2dx=∫p(x)∣∇logp(x)∣2dx+∫p(x)∣s(x)∣2dx−2∫∇logp(x)⋅p(x)s(x)dx. The first term does not depend on s, it is our constant c. For the last term, we use ∇logp=∇p/p then we use the integration-by-parts formula:
Practically, learning the score of a density p from samples x1,…,xn is done by minimizing the empirical version of the right-hand side of (3):
ℓ(θ)=n1i=1∑n∣sθ(xi)∣2+2∇⋅(sθ(xi)).
How do we empirically optimize (6) in the context of diffusion models?
In the context of diffusion models, we have a whole family of densities pt and we want to learn the score for every t∈[0,T].
First, we need not solve this optimization problem for every t. We could obviously discretize [0,T] with t1,…,tN and only solve for θti independently, but it is actually smarter and cheaper to approximate the whole function (t,x)→∇logpt(x) by a single neural network (a U-net, in general). That is, we use a parametrized family sθ(t,x). This enforces a form of time-continuity which seems natural. Now, since we want to aggregate the losses at each time, we solve the following problem:
where w(t) is a weighting function (for example, w(t) can be higher for t≈0, since we don't really care about approximating pT as precisely as p0).
In the preceding formulation we cannot exactly compute the expectation with respect to pt, but we can approximate it with our samples xti. Additionnaly, we need to approximate the integral, for instance we can discretize the time steps with t0=0<t1<⋯<tN=T. Our objective function becomes
which looks computable… except it's not ideal. Suppose we perform a gradient descent on θ to find the optimal θ for time t. Then at each gradient descent step, we need to evaluate sθ as well as its divergence; and then compute the gradient in θ of the divergence in x, in other words to compute ∇θ∇x⋅sθ. In high dimension, this can be too costly.
Fortunately, there is another way to perform score matching when pt is the distribution of a random variable with gaussian noise added, as in our setting. We'll present this result in a fairly abstract setting; we suppose that p is a density function, and q=p∗g where g is an other density. The following result is due to Vincent, 2010.
Denoising Score Matching Objective
Let s:Rd→Rd be a smooth function. Let X be a random variable with density p, and let ε be an independent random variable with density g. We call pnoisy the distribution of Xnoisy=X+ε. Then, E[∣∇logpnoisy(X+ε)−s(X+ε)∣2]=c+E[∣∇logg(ε)−s(X+ε)∣2] where c is a constant not depending on s.
Proof. Note that pnoisy=p∗g. By expanding the square, we see that E[∣∇logpnoisy(X+ε)−s(X+ε)∣2] is equal to
c+∫pnoisy(x)∣s(x)∣2dx−2∫∇pnoisy(x)⋅s(x)dx.
Now by definition, pnoisy(x)=∫p(y)g(x−y)dy, hence ∇pnoisy(x)=∫p(y)∇g(x−y)dy, and the last term above is equal to −2∫∫p(y)∇g(x−y)⋅s(x)dxdy=−2∫∫p(y)g(x−y)∇logg(x−y)⋅s(x)dydx=−2E[∇logg(ε)⋅s(X+ε)]. But then, upon adding and subtracting the term E[∣∇logg(ε)∣2] which does not depend on s, we get another constant c′ such that
It turns out that the Denoising Score Matching objective is just an avatar of a deep, not so-well-known result, called Tweedie's formula. Herbert Robbins is often credited with the first discovery of this formula in the context of exponential (Poisson) distributions; Maurice Tweedie extended it, and Bradley Efron popularized it in his excellent paper on selection bias.
Tweedie's formula
Let X be a random variable with density p, and let ε be an independent random variable with distribution N(0,σ2). We still note pnoisy the distribution of Xnoisy=X+ε. Then, E[X∣Xnoisy]=Xnoisy+σ2∇lnpnoisy(Xnoisy). Equivalently, E[ε∣Xnoisy]=−σ2∇lnpnoisy(Xnoisy).
Proof. The joint distribution of (X,Xnoisy) is p(x)g(z−x), hence the conditional expectation of X given Xnoisy=z is
which is the desired result. For the second formula, just note that Xnoisy=E[Xnoisy∣Xnoisy]=E[X∣Xnoisy]+E[ε∣Xnoisy] then simplify.
Since Xnoisy is centered around X, the classical ("frequentist") estimate for X given Xnoisy is Xnoisy. Tweedie's formula corrects this estimate by adding a term accounting for the fact that X is itself random: for instance, if Xnoisy lands in a region where X is extremely unlikely to live in, then it is probable that the noise is responsible for this, and the estimate for X.
Now, what's the link between this and Denoising Score Matching ? Well, the conditional expectation of any X given Z minimizes the L2-distance, in the sense that E[X∣Z]=f(Z) where f minimizes E[∣f(Z)−X∣2]. Formula (15) says that ∇lnpnoisy, the quantity we need to estimate, is nothing but the « best denoiser » up to a scaling −σ2:
∇lnpnoisy∈argminE[∣f(X+ε)−(−ε/σ2)∣2].
Replacing f with a neural network, we exactly find the Denoising Score Matching objective in (10).
I find this result to give some intuition on score matching and how to parametrize the network.
If s(x) is « noise predictor » trained with the objective E[∣s(X+ε)−ε∣2], then −s(x)/σ2 is a proxy for ∇lnpnoisy. This is called « noise prediction » training.
If s(x) had rather been trained on a pure denoising objective E[∣s(X+ε)−X∣2], then x−s(x) is a noise predictor for X, hence σ−2(s(x)−x) is a proxy for ∇lnpnoisy. This is called « data prediction » training, or simply denoising training.
Both formulations are found in the litterature, as well as affine mixes of both.
Let us apply this to our setting. Remember that pt is the density of αtX0+εt where εt∼N(0,σˉt2), hence in this case g(x)=(2πσˉt2)−d/2e−∣x∣2/2σˉt2 and ∇logg(x)=−x/σˉt2.
Data prediction model
The « data prediction » (or denoising) parametrization of the neural network would minimize the objective
∫0Tw(t)E[∣sθ(t,Xt)−αtX0∣2]dt.
Noise prediction model
Alternatively, the « noise prediction » parametrization of the neural network would minimize the objective
∫0Tw(t)E[∣sθ(t,Xt)−ε∣2]dt
where I noted ε instead of X1 to emphasize we're predicting noise.
Since we have access to samples (xi,εi,τi) where τ∼w, we get the empirical version:
ℓ^(θ)=n1i=1∑n[∣εi−sθ(ατxi+σˉτεi)∣2].
Up to the constants and the choice of the drift μt and variance wt, this is exactly the loss function (14) from the DDPM paper. Once this is done, the proxy for ∇lnpt would be
∇lnpt(x)≈σˉt2x−sθ(t,x).
Plugging this back into the SDE (DDPM) sampling formula, we get
We are now equipped with learning ∇lnpt when pt is the distribution of something like αtX0+σtε. The diffusion models we presented earlier provide such distributions given the drift and diffusion coefficients, μt and σt. But in general, we should be able to directly choose α and σ without relying on the (intricate) maths behind the Fokker-Planck equation. This is where the flow matching formulation enters the game.