Flow models III: Flow Matching

March 2025

In the preceding notes, we've seen how diffusion models are trained and sampled from and we've seen how the score function lnpt\nabla \ln p_t is efficiently learnt using score matching. However, since their inception, diffusion models felt a little bit weird for various reasons.

For these reasons, the community has been looking for a new model that would be more intuitive, more flexible, and more powerful, able to bridge any two distributions in finite time, with deterministic (ODE) sampling. In the end, it turns out that Flow-Matching is almost entirely equivalent to diffusion score mathching, except that the presentation and the way we're doing things is slightly different – but way more flexible.

Flow matching

Let (X0,X1)(X_0, X_1) be a couple of random variables sampled from p0p_0 and p1p_1. Actually, they can even be dependent: we only impose that their marginal distributions are p0p_0 and p1p_1, so we note π\pi their joint distribution and we suppose that π(x,y)dx=p1(y)\int \pi(x,y)dx = p_1(y) and π(x,y)dy=p0(x)\int \pi(x,y)dy = p_0(x). From a probabilistic perspective, (X0,X1)(X_0, X_1) can be any coupling between p0,p1p_0, p_1.

Conditional and annealed flows

Suppose that there is a smooth function φ:(t,x,y)φt(x,y)\varphi : (t, x, y) \to \varphi_t(x,y) such that φ0(x,y)=x\varphi_0(x,y) = x and φ1(x,y)=y\varphi_1(x,y)=y. This provides a connection between p0p_0 and p1p_1 by defining random variables

Xt=φt(X0,X1).X_t = \varphi_t(X_0, X_1).

This connection is called the conditional flow of the system. We note ptp_t the density of XtX_t.

The flow comes from an ODE. The probability path ptp_t satisfies the continuity equation tpt=(vtpt)\partial_t p_t = -\nabla \cdot (v_t p_t) where vt(x)=E[Xt˙Xt=x].v_t(x) = \mathbb{E}[\dot{X_t}| X_t = x]. In other words, XtX_t has the same marginals as the ODE system x˙t=vt(xt),x0p0.\dot{x}_t = v_t(x_t), \qquad x_0 \sim p_0.

We emphasize the fact that XtX_t does not satisfy in general. The main point is that XtX_t (the conditional flow) and xtx_t (the unconditional, or annealed flow, defined by the ODE) have the same marginals ptp_t. This is more or less what happened for diffusions, where the SDE and ODE paths had the same marginals but not the same distribution.

Proof. We follow the proofs in the Stochastic Interpolant paper. The Fourier transform of ptp_t is p^t(ξ)=E[eiξ,Xt]\hat{p}_t(\xi) = \mathbb{E}[e^{i \langle \xi, X_t\rangle}]. Differentiating in tt yields tp^t(ξ)=tp^(ξ)\partial_t \hat{p}_t(\xi) = \widehat{\partial_t p}(\xi) since time-differentiation and Fourier transform commute. On the other hand, by passing t\partial_t inside the expectation and conditioning on XtX_t, we get tpt^(ξ)=E[iξX˙teiξ,Xt]=E[iξeiξ,XtE[X˙tXt]]=E[iξeiξ,Xtvt(Xt)]=pt(x)vt(x)xeiξ,xdx.\begin{aligned} \widehat{\partial_t p_t}(\xi) &= \mathbb{E}[i\xi \dot{X}_t e^{i\langle\xi, X_t\rangle}]\\ &=\mathbb{E}[i\xi e^{i\langle\xi, X_t\rangle}\mathbb{E}[\dot{X}_t\mid X_t]]\\ &=\mathbb{E}[i\xi e^{i\langle\xi, X_t\rangle} v_t(X_t)] \\ &= \int p_t(x)v_t(x) \nabla_x e^{i\langle \xi, x \rangle}dx. \end{aligned}

Also, since xeiξ,x=iξeiξ,x\nabla_x e^{i\langle \xi, x\rangle} = i\xi e^{i\langle \xi, x\rangle}, the last integral is equal to

x[vt(x)pt(x)]eiξ,xdx=vtpt^(ξ).-\int \nabla_x \cdot [v_t(x)p_t(x)] e^{i\langle \xi, x\rangle}dx = \widehat{-\nabla \cdot v_tp_t}(\xi).

Since the Fourier transform is injective, we get tpt=vtpt\partial_t p_t = -\nabla \cdot v_t p_t.

The joint distribution of XtX_t and X1X_1 is given by π(x0,x1)pt(xx0,x1)\pi(x_0, x_1)p_t(x \mid x_0, x_1). Consequently, the conditional distribution of XtX_t given X1X_1 is π(x0,x1)pt(xx0,x1)/pt(x)\pi(x_0, x_1)p_t(x \mid x_0, x_1) / p_t(x), where pt(x)p_t(x) is the marginal density of XtX_t. Formally, we can thus write the velocity field as the following integral:

vt(x)=Rdφ˙t(x0,x1)pt(xx0,x1)π(x0,x1)pt(x)dx0dx1.v_t(x) = \int_{\mathbb{R}^d} \dot{\varphi}_t(x_0, x_1)\frac{p_t(x \mid x_0, x_1)\pi(x_0, x_1)}{p_t(x)}dx_0dx_1.

This is the formula appearing in Lipman's paper.

Sampling the probability path.

Sampling XtX_t is easy when we have at our disposal samples X0,X1X_0, X_1 from p0,p1p_0, p_1. But when we have only one of them, say X0X_0, we cannot use this formula, so we have to sample the ODE x˙t=vt(xt)\dot{x}_t = v_t(x_t) started at X0X_0: but this would need knowledge of vtv_t. That is not directly doable, since its expression needs knowledge of ptp_t. However, the L2L^2 loss E[s(Xt)vt(Xt)2]\mathbb{E}[|s(X_t) - v_t(X_t)|^2] can efficiently be minimized without knowing vtv_t.

Learning the velocity

Flow Matching Loss.

Let ss be any function. Then,

E[s(Xt)vt(Xt)2]=E[s(Xt)X˙t2]+c,\mathbb{E}[|s(X_t) - v_t(X_t)|^2] = \mathbb{E}[|s(X_t) - \dot{X}_t|^2] + c,

where c=E[X˙t2]E[vt(Xt)2]c = \mathbb{E}[|\dot{X}_t|^2] - \mathbb{E}[|v_t(X_t)|^2] is a constant with respect to ss.

The practical consequence is that if sθs^\theta is smoothly parametrized by θ\theta, then the L2L^2-loss

L(θ)=E[sθ(Xt)vt(Xt)2]L_\star(\theta) = \mathbb{E}[|s^\theta(X_t) - v_t(X_t)|^2]

and the Flow-Matching loss

L(θ)=E[sθ(Xt)X˙t2]L(\theta) = \mathbb{E}[|s^\theta(X_t) - \dot{X}_t|^2]

have the same gradients and the same minimizers.

Proof. Develop the square: E[s(Xt)vt(Xt)2]=E[s(Xt)2]+E[vt(Xt)2]2E[s(Xt),vt(Xt)].\begin{aligned}\mathbb{E}[|s(X_t) - v_t(X_t)|^2] &= \mathbb{E}[|s(X_t)|^2] + \mathbb{E}[|v_t(X_t)|^2] - 2\mathbb{E}[\langle s(X_t), v_t(X_t)\rangle]. \\ \end{aligned} The last term is equal to

E[s(Xt),E[X˙tXt]].\mathbb{E}[\langle s(X_t), \mathbb{E}[\dot{X}_t \mid X_t]\rangle].

Since averages commute with any linear operator, we can write this as E[E[s(Xt),X˙tXt]]\mathbb{E}[\mathbb{E}[ \langle s(X_t), \dot{X}_t \rangle \mid X_t]], then we can decondition and get E[s(Xt),X˙t]\mathbb{E}[\langle s(X_t), \dot{X}_t\rangle]. Going back to the first line, adding and subtracting E[X˙t2]\mathbb{E}[|\dot{X} t|^2], we get E[s(Xt)vt(Xt)2]=E[s(Xt)2]+E[vt(Xt)2]2E[s(Xt),X˙t]+E[X˙t2]E[X˙t2].=E[s(Xt)X˙t2]+E[X˙t2]E[vt(Xt)2].\begin{aligned}\mathbb{E}[|s(X_t) - v_t(X_t)|^2] &= \mathbb{E}[|s(X_t)|^2] + \mathbb{E}[|v_t(X_t)|^2] - 2\mathbb{E}[\langle s(X_t), \dot{X}_t\rangle] + \mathbb{E}[|\dot{X}_t|^2] - \mathbb{E}[|\dot{X}_t|^2]. \\ &= \mathbb{E}[|s(X_t) - \dot{X}_t|^2] + \mathbb{E}[|\dot{X}_t|^2] - \mathbb{E}[|v_t(X_t)|^2]. \end{aligned} The last two terms are constants with respect to ss.

Everything is now tractable. In practice, to learn vtv_t we use a parametrized family of smooth functions stθs^\theta_t and we minimize LL for "any" time tt: we

Design of conditional flows

Now that everything is set, we have to design an efficient conditional flow φt\varphi_t.

Linear flows

The simplest (and, indeed, very powerful) flow is the linear one, φt(x,y)=αtx+σty\varphi_t(x,y) = \alpha_t x + \sigma_t y, giving

Xt=αtX0+σtX1.X_t = \alpha_t X_0 + \sigma_t X_1.

where α,σ\alpha, \sigma are differentiable and satisfy α0=σ1=1\alpha_0 = \sigma_1 = 1 and α1=σ0=0\alpha_1 = \sigma_0 = 0. The trajectories are straight lines going from X0X_0 to X1X_1 at a velocity given by

X˙t=α˙tX0+σ˙tX1.\dot{X}_t = \dot{\alpha}_t X_0 + \dot{\sigma}_t X_1.

In the simplest setting αt=1t\alpha_t =1-t and σt=t\sigma_t = t, the velocity is constant, X˙t=X1X0\dot{X}_t = X_1 - X_0, so that the flow-matching loss minimizes E[s(Xt)(X1X0)2]\mathbb{E}[|s(X_t) - (X_1 - X_0)|^2].

"Gaussian" flows

In practice, the goal of (most) generative models is to sample from p0p_0, which leaves open the choice for p1p_1. The natural choice goes for simplicity, with p1=N(0,Id)p_1 = N(0, I_d). In this case, noting ε\varepsilon instead of X1X_1, the marginals of (13) are exactly the ones we found for the noising process in diffusions. The conditional velocity vt(x)=E[X˙tXt=x]v_t(x) = \mathbb{E}[\dot{X}_t \mid X_t=x] is simply

vt(x)=αt˙E[X0Xt=x]+σt˙E[εXt=x].v_t(x) = \dot{\alpha_t}\mathbb{E}[X_0 \mid X_t=x]+\dot{\sigma_t}\mathbb{E}[\varepsilon \mid X_t = x].

Tweedie's formula, as seen in the preceding note on score matching, gives

vt(x)=α˙tαtx+σt2lnpt(x)[α˙tαtσ˙tσt].v_t(x) = \frac{\dot{\alpha}_t}{\alpha_t}x + \sigma_t^2 \nabla \ln p_t(x)\left[ \frac{\dot{\alpha}_t}{\alpha_t} - \frac{\dot{\sigma}_t}{\sigma_t} \right].

Note that α˙t/αtσ˙t/σt\dot{\alpha}_t/\alpha_t - \dot{\sigma}_t/\sigma_t is equal to λ˙t/2\dot{\lambda}_t/2, which is half the time-derivative of the lof-SNR (signel-to-noise) ratio λt=ln(αt2/σt2)\lambda_t = \ln(\alpha_t^2/\sigma_t^2).

Proof. Tweedie's denoising formula says that since ptp_t is the distribution of αtX0\alpha_t X_0 noised by σtεN(0,σt2Id)\sigma_t \varepsilon \sim N(0, \sigma_t^2 I_d), then

E[εXt]=E[σtεXt]σt=σtlnpt(Xt).\mathbb{E}[\varepsilon \mid X_t] = \frac{\mathbb{E}[\sigma_t \varepsilon \mid X_t]}{\sigma_t} = -\sigma_t \nabla \ln p_t(X_t).

Similarly,

E[X0Xt]=Xt+σt2lnpt(Xt)αt.\mathbb{E}[X_0 \mid X_t] = \frac{X_t + \sigma_t^2\nabla \ln p_t(X_t)}{\alpha_t}.

Gathering the two yields the formula.

Formula (16) tells us (once again!) that the only thing that matters given the choice of αt,σt\alpha_t, \sigma_t is the knowledge of the score lnpt\nabla \ln p_t. How this score was learnt is actually a secondary problem: we don't care if the learning was done in a diffusion framework or whatever. When we have it, we can plug it in (16) and sample.

By carefully looking at the derivations, we can see that there are one-to-one linear connections between

Only one of them needs to be learned, and we can convert it into the three others. If one wants to use a pretrained model for sampling, one thus needs to keep track of how the model was learned: is it a score, a denoiser, a data predictor or a velocity?

We close this part with an important note: given a score model for lnpt\nabla \ln p_t, regardless of how it is formulated and trained, using it to sample from the flow x˙t=vt(xt)\dot{x}_t = v_t(x_t) or to sample from the DDIM ODE in diffusion models is exactly the same.

Optimal transport flows

There is room for the choice of the connection φt\varphi_t; among these choices, some of them should be better than the others. The choice of one of these connections provides a velocity field vtv_t transporting samples from p0p_0 to samples from p1p_1. Consider all the possibles fields utu_t such that the solutions of the ODE x˙t=u(xt)\dot{x}_t = -u_(x_t) started at x0p0x_0 \sim p_0 and at x1p1x_1 \sim p_1. The best possible one should minimize the total kinetic energy,

E01ut(xt)2dt\mathbb{E}\int_0^1 |u_t(x_t)|^2dt

where the expection is over p0p_0. This problem is widely studied and can be solved: it is quite intuitive that the optimal flows are straight lines, so the optimal velocity fields are constants, and indeed the flows are given by ut(x)=x+t(ϕ(x)x)u^\star_t(x) = x+t(\phi(x)-x) where φ\varphi is Brenier's potentiel, the unique (under some conditions) map ϕ:RdRd\phi : \mathbb{R}^d \to \mathbb{R}^d such that ϕ(X0)\phi(X_0) has distribution p1p_1 and which minimizes the square transport distance E[xφ(x)2]\mathbb{E}[|x - \varphi(x)|^2]. Of course, computing ϕ\phi is in general intractable.

This solves the unconditional OT problem, and we need conditional flows. One nice trick goes as follows: using Jensen's inequality and de-conditioning, we can write

E01vt(Xt)2dt=E01E[X˙tXt]2dtE01E[X˙t2Xt]dt=E01E[X˙t2]dt=E01X˙t2dt.\begin{aligned}\mathbb{E}\int_0^1 |v_t(X_t)|^2 dt &= \mathbb{E}\int_0^1 |\mathbb{E}[\dot{X}_t \mid X_t]|^2 dt \\ &\leqslant \mathbb{E}\int_0^1 \mathbb{E}[|\dot{X}_t|^2 \mid X_t] dt \\ &= \mathbb{E}\int_0^1 \mathbb{E}[|\dot{X}_t|^2] dt \\ &= \mathbb{E}\int_0^1 |\dot{X}_t|^2 dt. \end{aligned}

This last bound is the expected kinetic energy of the conditional flow over the boundary distributions X0p0,X1p1X_0 \sim p_0, X_1 \sim p_1. For each realization X0,X1X_0, X_1, we can try to find the optimal transport map γt\gamma_t which minimizes 01γ˙t2dt\int_0^1 |\dot\gamma_t|^2 dt subject to the boundary conditions γ0=X0\gamma_0 = X_0 and γ1=X1\gamma_1 = X_1. The solution is obvisouly the straight line γt=x0+t(x1x0)\gamma_t = x_0 + t(x_1 - x_0) (this can be found formally using the Euler-Lagrange conditions).

The conclusion of these considerations is as follows.

The conditional linear flow Xt=(1t)X0+tX1X_t = (1-t)X_0 + tX_1 is a minimizer of a bound on the Kinetic Energy among all the flows transporting p0p_0 to p1p_1.

It might not be a minimizer of the Kinetic Energy itself, but it is a good starting point. In any ways, it is important to keep in mind that flowing straight between two points is absolutely not the same as flowing straight between two distributions.

Conclusion

Up to now, we've seen how flow matching reformulates and simplifies diffusion sampling. There remains a problem: in practive, to sample from these models, we need to solve an ODE or a SDE, which in practive is done by discretizing time and using a scheme like Euler or Runge-Kutta. But for every time step, this needs a feedforward evaluation of a neural network. Most schemes need 50 time steps, which is why sampling from diffusions can be long.

Consistency models try to directly learn the flow φt\varphi_t and not the velocity vtv_t - and yes, the naming should have been velocity matching from the beginning.

References

The three seminal papers who found (independently) the Flow Matching formulation of generative diffusion models are

I find the stochastic interpolant formulation way cleaner and nicer than Lipman's one.

Since then, there were many surveys on the topics. I pretty much like META's excellent survey of Flow Matching by Lipman and coauthors, for its depths and variety. There is also this nice blog post on the topic by Mathurin Massias and others, and this very recent one by people at DeepMind, clarifying the link between diffusions and FM.

These excellent slides by Brandon Amos present a history of the topic and how we evolved from diffusions to flows through neural ODEs and normalizing flows.

Training FM « at scale », by the Stability team, who in my knowledge were the first ones to scale FM training. They later produced the FLUX family of models.

Brenier's theorem, a little bit mathy.