Flow models II: Score Matching Techniques

March 2025

Let pp be any smooth probability density function. Its score is the gradient of its log-density: logp(x)\nabla \log p(x). This object is of paramount importance in many fields, like physics, statistics, and machine learning. In particular, it is the key to sample from pp using the Langevin dynamics, and we've seen it to be the key when reversing diffusion processes. In this note, we survey the classical technique used for learning the score of a density from its samples: score matching.

Vanilla Score Matching

The Fisher Divergence

The L2-distance between the scores of two probability densities is often called the Fisher divergence

fisher(ρ1ρ2)=ρ1(x)logρ1(x)logρ2(x)2dx. \mathrm{fisher}(\rho_1 \mid \rho_2) = \int \rho_1(x)|\nabla\log\rho_1(x) - \nabla\log\rho_2(x)|^2dx.

Since our goal is to learn logp(x)\nabla\log p(x), it is natural to choose a parametrized family of functions sθs_\theta and to optimize θ\theta so that the divergence

p(x)logp(x)sθ(x)2dx\int p(x)|\nabla\log p(x) - s_\theta(x)|^2dx

is as small as possible. However, this optimization problem is intractable, due to the presence of the explicit form of pp inside the integral. This is where Score Matching techniques come into play.

The score matching trick

Let pp be a smooth probability density function supported over Rd\mathbb{R}^d and let XX be a random variable with density pp. The following elementary identity is due to Hyvärinen, 2005; it is the basis for score matching estimation in statistics.

Let s: RdRds : \mathbb{R}^d \to \mathbb{R}^d be any smooth function with sufficiently fast decay at \infty, and XpX \sim p. Then, E[logp(X)s(X)2]=c+E[s(X)2+2s(X)] \mathbb{E}[\vert \nabla \log p(X) - s(X)\vert^2] = c + \mathbb{E}\left[|s(X)|^2 + 2 \nabla \cdot s(X)\right] where cc is a constant not depending on ss.

Proof. We start by expanding the square norm: p(x)logp(x)s(x)2dx=p(x)logp(x)2dx+p(x)s(x)2dx2logp(x)p(x)s(x)dx.\begin{aligned}\int p(x)|\nabla \log p(x) - s(x)|^2 dx &= \int p(x)|\nabla \log p(x)|^2 dx + \int p(x)|s(x)|^2 dx - 2\int \nabla \log p(x)\cdot p(x)s(x) dx. \end{aligned} The first term does not depend on ss, it is our constant cc. For the last term, we use logp=p/p\nabla \log p = \nabla p / p then we use the integration-by-parts formula: 

2logp(x)p(x)s(x)dx=2p(x)s(x)dx=2p(x)(s(x))dx2\int \nabla \log p(x)\cdot p(x)s(x) dx = 2\int \nabla p(x) \cdot s(x) dx = -2 \int p(x)( \nabla \cdot s(x))dx

and the identity is proved.

The loss we want to minimize is thus

θarg minθE[logp(X)sθ(X)2]=arg minθE[sθ(X)2+2(sθ(X))]. \theta \in \argmin_\theta \mathbb{E}[\vert \nabla \log p(X) - s_{\theta}(X)\vert^2] = \argmin_\theta \mathbb{E}[|s_{\theta}(X)|^2 + 2 \nabla \cdot (s_{\theta}(X))].

Practically, learning the score of a density pp from samples x1,,xnx^1, \dots, x^n is done by minimizing the empirical version of the right-hand side of (3):

(θ)=1ni=1nsθ(xi)2+2(sθ(xi)). \ell(\theta) = \frac{1}{n}\sum_{i=1}^n |s_{\theta}(x^i)|^2 + 2 \nabla \cdot (s_{\theta}(x^i)).

How do we empirically optimize (6) in the context of diffusion models?

In the context of diffusion models, we have a whole family of densities ptp_t and we want to learn the score for every t[0,T]t \in [0,T].

  1. First, we need not solve this optimization problem for every tt. We could obviously discretize [0,T][0,T] with t1,,tNt_1, \dots, t_N and only solve for θti\theta_{t_i} independently, but it is actually smarter and cheaper to approximate the whole function (t,x)logpt(x)(t,x) \to \nabla \log p_t(x) by a single neural network (a U-net, in general). That is, we use a parametrized family sθ(t,x)s_\theta(t,x). This enforces a form of time-continuity which seems natural. Now, since we want to aggregate the losses at each time, we solve the following problem: 

arg minθ0Tw(t)E[sθ(t,Xt)2+2(sθ(t,Xt))]dt\argmin_\theta \int_0^T w(t)\mathbb{E}[|s_{\theta}(t, X_t)|^2 + 2 \nabla \cdot (s_{\theta}(t, X_t))]dt

where w(t)w(t) is a weighting function (for example, w(t)w(t) can be higher for t0t\approx 0, since we don't really care about approximating pTp_T as precisely as p0p_0).

  1. In the preceding formulation we cannot exactly compute the expectation with respect to ptp_t, but we can approximate it with our samples xtix_t^i. Additionnaly, we need to approximate the integral, for instance we can discretize the time steps with t0=0<t1<<tN=Tt_0=0 < t_1 < \dots < t_N = T. Our objective function becomes

(θ)=1nt{t0,,tN}w(t)i=1nsθ(t,xti)2+2(sθ(t,xti)) \ell(\theta) =\frac{1}{n}\sum_{t \in \{t_0, \dots, t_N\}} w(t)\sum_{i=1}^n |s_{\theta}(t, x_t^i)|^2 + 2 \nabla\cdot(s_{\theta}(t, x_t^i))

which looks computable… except it's not ideal. Suppose we perform a gradient descent on θ\theta to find the optimal θ\theta for time tt. Then at each gradient descent step, we need to evaluate sθs_{\theta} as well as its divergence; and then compute the gradient in θ\theta of the divergence in xx, in other words to compute θxsθ\nabla_\theta \nabla_x \cdot s_\theta. In high dimension, this can be too costly.

Tweedie's formula for denoising

Denoising Score Matching

Fortunately, there is another way to perform score matching when ptp_t is the distribution of a random variable with gaussian noise added, as in our setting. We'll present this result in a fairly abstract setting; we suppose that pp is a density function, and q=pgq = p*g where gg is an other density. The following result is due to Vincent, 2010.

Denoising Score Matching Objective

Let s:RdRds:\mathbb{R}^d \to \mathbb{R}^d be a smooth function. Let XX be a random variable with density pp, and let ε\varepsilon be an independent random variable with density gg. We call pnoisyp_{\mathrm{noisy}} the distribution of Xnoisy=X+εX_{\mathrm{noisy}}=X + \varepsilon. Then, E[logpnoisy(X+ε)s(X+ε)2]=c+E[logg(ε)s(X+ε)2] \mathbb{E}[\vert \nabla \log p_{\mathrm{noisy}}(X+\varepsilon) - s(X+\varepsilon)\vert^2] = c + \mathbb{E}[|\nabla \log g(\varepsilon) - s(X+\varepsilon)|^2] where cc is a constant not depending on ss.

Proof. Note that pnoisy=pgp_{\mathrm{noisy}} = p * g. By expanding the square, we see that E[logpnoisy(X+ε)s(X+ε)2] \mathbb{E}[\vert \nabla \log p_{\mathrm{noisy}}(X+\varepsilon) - s(X+\varepsilon)\vert^2] is equal to

c+pnoisy(x)s(x)2dx2pnoisy(x)s(x)dx. c + \int p_{\mathrm{noisy}}(x)|s(x)|^2dx -2\int \nabla p_{\mathrm{noisy}}(x)\cdot s(x)dx.

Now by definition, pnoisy(x)=p(y)g(xy)dyp_{\mathrm{noisy}}(x) = \int p(y)g(x-y)dy, hence pnoisy(x)=p(y)g(xy)dy\nabla p_{\mathrm{noisy}}(x) = \int p(y)\nabla g(x-y)dy, and the last term above is equal to 2p(y)g(xy)s(x)dxdy=2p(y)g(xy)logg(xy)s(x)dydx=2E[logg(ε)s(X+ε)].\begin{aligned} -2\int \int p(y)\nabla g(x-y)\cdot s(x)dxdy &= -2\int \int p(y)g(x-y)\nabla \log g(x-y)\cdot s(x)dydx\\ &= -2\mathbb{E}[\nabla \log g(\varepsilon)\cdot s(X + \varepsilon)]. \end{aligned} But then, upon adding and subtracting the term E[logg(ε)2]\mathbb{E}[|\nabla \log g(\varepsilon)|^2] which does not depend on ss, we get another constant cc' such that

E[logpnoisy(X)s(X)2]=c+E[logg(ε)s(X+ε)2]. \mathbb{E}[\vert \nabla \log p_{\mathrm{noisy}}(X) - s(X)\vert^2] = c' + \mathbb{E}[|\nabla \log g(\varepsilon) - s(X + \varepsilon)|^2].

Tweedie's formula

It turns out that the Denoising Score Matching objective is just an avatar of a deep, not so-well-known result, called Tweedie's formula. Herbert Robbins is often credited with the first discovery of this formula in the context of exponential (Poisson) distributions; Maurice Tweedie extended it, and Bradley Efron popularized it in his excellent paper on selection bias.

Tweedie's formula

Let XX be a random variable with density pp, and let ε\varepsilon be an independent random variable with distribution N(0,σ2)N(0, \sigma^2). We still note pnoisyp_{\mathrm{noisy}} the distribution of Xnoisy=X+εX_{\mathrm{noisy}} = X + \varepsilon. Then, E[XXnoisy]=Xnoisy+σ2lnpnoisy(Xnoisy). \mathbb{E}[X \mid X_{\mathrm{noisy}}] = X_{\mathrm{noisy}} + \sigma^2 \nabla \ln p_{\mathrm{noisy}}(X_{\mathrm{noisy}}). Equivalently, E[εXnoisy]=σ2lnpnoisy(Xnoisy). \mathbb{E}\left[\left. \varepsilon \right| X_{\mathrm{noisy}}\right] = -\sigma^2 \nabla \ln p_{\mathrm{noisy}}(X_{\mathrm{noisy}}).

Proof. The joint distribution of (X,Xnoisy)(X, X_{\mathrm{noisy}}) is p(x)g(zx)p(x)g(z-x), hence the conditional expectation of XX given Xnoisy=zX_{\mathrm{noisy}} = z is

xp(x)g(zx)dxq(z)=z(zx)g(zx)p(x)dxpnoisy(z).\frac{\int x p(x)g(z-x)dx}{q(z)} = z - \frac{\int (z-x)g(z-x)p(x)dx}{p_{\mathrm{noisy}}(z)}.

Note that pnoisy(z)=p(x)g(zx)dxp_{\mathrm{noisy}}(z) = \int p(x)g(z-x)dx. Now, since gg is the density of N(0,σ2Id)N(0,\sigma^2 I_d), we have g(x)=x/σ2g(x)\nabla g(x) = -x/\sigma^2 g(x), and the term above is equal to

z+σ2zg(zx)p(x)dxpnoisy(z)=z+σ2zlng(zx)p(x)dxz +\sigma^2 \frac{\int \nabla_z g(z-x)p(x)dx}{p_{\mathrm{noisy}}(z)} = z + \sigma^2 \nabla_z \ln \int g(z-x)p(x)dx

which is the desired result. For the second formula, just note that Xnoisy=E[XnoisyXnoisy]=E[XXnoisy]+E[εXnoisy]X_{\mathrm{noisy}} = \mathbb{E}[X_{\mathrm{noisy}}|X_{\mathrm{noisy}}] = \mathbb{E}[X \mid X_{\mathrm{noisy}}] + \mathbb{E}[\varepsilon \mid X_{\mathrm{noisy}}] then simplify.

Since XnoisyX_{\mathrm{noisy}} is centered around XX, the classical ("frequentist") estimate for XX given XnoisyX_{\mathrm{noisy}} is XnoisyX_{\mathrm{noisy}}. Tweedie's formula corrects this estimate by adding a term accounting for the fact that XX is itself random: for instance, if XnoisyX_{\mathrm{noisy}} lands in a region where XX is extremely unlikely to live in, then it is probable that the noise is responsible for this, and the estimate for XX.

Now, what's the link between this and Denoising Score Matching ? Well, the conditional expectation of any XX given ZZ minimizes the L2L^2-distance, in the sense that E[XZ]=f(Z)\mathbb{E}[X \mid Z] = f(Z) where ff minimizes E[f(Z)X2]\mathbb{E}[|f(Z) - X|^2]. Formula (15) says that lnpnoisy\nabla \ln p_{\mathrm{noisy}}, the quantity we need to estimate, is nothing but the « best denoiser » up to a scaling σ2-\sigma^2

lnpnoisyargminE[f(X+ε)(ε/σ2)2]. \nabla \ln p_{\mathrm{noisy}} \in \arg \min \mathbb{E}[|f(X + \varepsilon) - (-\varepsilon/\sigma^2)|^2].

Replacing ff with a neural network, we exactly find the Denoising Score Matching objective in (10).

I find this result to give some intuition on score matching and how to parametrize the network.

Both formulations are found in the litterature, as well as affine mixes of both.

Back to diffusions

Let us apply this to our setting. Remember that ptp_t is the density of αtX0+εt\alpha_t X_0 + \varepsilon_t where εtN(0,σˉt2)\varepsilon_t \sim \mathscr{N}(0,\bar{\sigma}_t^2), hence in this case g(x)=(2πσˉt2)d/2ex2/2σˉt2g(x) = (2\pi\bar{\sigma}_t^2)^{-d/2}e^{-|x|^2 / 2\bar{\sigma}_t^2} and logg(x)=x/σˉt2\nabla \log g(x) = - x / \bar{\sigma}^2_t.

Data prediction model

The « data prediction » (or denoising) parametrization of the neural network would minimize the objective

0Tw(t)E[sθ(t,Xt)αtX02]dt. \int_0^T w(t)\mathbb{E}[|s_{\theta}(t, X_t) - \alpha_t X_0|^2]dt.

Noise prediction model

Alternatively, the « noise prediction » parametrization of the neural network would minimize the objective

0Tw(t)E[sθ(t,Xt)ε2]dt \int_0^T w(t)\mathbb{E}[|s_{\theta}(t, X_t) - \varepsilon|^2]dt

where I noted ε\varepsilon instead of X1X_1 to emphasize we're predicting noise.

Since we have access to samples (xi,εi,τi)(x^i, \varepsilon^i, \tau^i) where τw\tau \sim w, we get the empirical version:

^(θ)=1ni=1n[εisθ(ατxi+σˉτεi)2].\hat{\ell}(\theta) = \frac{1}{n}\sum_{i=1}^n \left[|\varepsilon^i - s_\theta(\alpha_\tau x^i + \bar{\sigma}_\tau \varepsilon^i)|^2\right].

Up to the constants and the choice of the drift μt\mu_t and variance wtw_t, this is exactly the loss function (14) from the DDPM paper. Once this is done, the proxy for lnpt\nabla \ln p_t would be

lnpt(x)xsθ(t,x)σˉt2.\nabla \ln p_t(x) \approx \frac{x - s_{\theta}(t,x)}{\bar{\sigma}_t^2}.

Plugging this back into the SDE (DDPM) sampling formula, we get

dYt=μTtYt+2wTt2σˉTt2(Ytsθ(Tt,Yt))dt+2wTtdBt dY_t = \mu_{T-t}Y_t + 2\frac{w^2_{T-t}}{\bar{\sigma}_{T-t}^2}(Y_t - s_{\theta}(T-t,Y_t))dt + \sqrt{2w_{T-t}}dB_t

or

dYt=(μTt+2wTt2σˉTt2)Yt2wTt2σˉTt2sθ(Tt,Yt)dt+2wTtdBt. dY_t = \left(\mu_{T-t} + 2\frac{w^2_{T-t}}{\bar{\sigma}_{T-t}^2}\right)Y_t - 2\frac{w^2_{T-t}}{\bar{\sigma}_{T-t}^2}s_{\theta}(T-t,Y_t)dt + \sqrt{2w_{T-t}}dB_t.

Conclusion

We are now equipped with learning lnpt\nabla \ln p_t when ptp_t is the distribution of something like αtX0+σtε\alpha_t X_0+\sigma_t \varepsilon. The diffusion models we presented earlier provide such distributions given the drift and diffusion coefficients, μt\mu_t and σt\sigma_t. But in general, we should be able to directly choose α\alpha and σ\sigma without relying on the (intricate) maths behind the Fokker-Planck equation. This is where the flow matching formulation enters the game.

References

Hyvärinen's paper on Score Matching.

Efron's paper on Tweedie's formula - a gem in statistics.