In the preceding notes, we've seen how diffusion models are trained and sampled from and we've seen how the score function is efficiently learnt using score matching. However, since their inception, diffusion models felt a little bit weird for various reasons.
First, they did not « really » bridge with . They bridge with which is only approximately . This is absolutely not important practically, but from a theoretical point of view, it is a bit unsatisfactory. There should be a way to bridge with exactly in finite time.
Second, the design of a diffusion feels a little bit clunky. How do we choose the drift and diffusion coefficients ? In the end, it looks that the coefficients such that has the same law as are the ones who matter, so why note directly choose them?
Finally, this ODE/SDE duality is a bit confusing. In the end, the SDE formulation is not really useful since we are only interested in the marginals , and the ODE sampling feels really simpler. There was a time when it was not clear why SDEs seem to work better (while there was absolutely no theoretical reason for that).
For these reasons, the community has been looking for a new model that would be more intuitive, more flexible, and more powerful, able to bridge any two distributions in finite time, with deterministic (ODE) sampling. In the end, it turns out that Flow-Matching is almost entirely equivalent to diffusion score mathching, except that the presentation and the way we're doing things is slightly different – but way more flexible.
Let be a couple of random variables sampled from and . Actually, they can even be dependent: we only impose that their marginal distributions are and , so we note their joint distribution and we suppose that and . From a probabilistic perspective, can be any coupling between .
Conditional and annealed flows
Suppose that there is a smooth function such that and . This provides a connection between and by defining random variables
This connection is called the conditional flow of the system. We note the density of .
We emphasize the fact that does not satisfy in general. The main point is that (the conditional flow) and (the unconditional, or annealed flow, defined by the ODE) have the same marginals . This is more or less what happened for diffusions, where the SDE and ODE paths had the same marginals but not the same distribution.
Proof. We follow the proofs in the Stochastic Interpolant paper. The Fourier transform of is . Differentiating in yields since time-differentiation and Fourier transform commute. On the other hand, by passing inside the expectation and conditioning on , we get
Also, since , the last integral is equal to
Since the Fourier transform is injective, we get .
The joint distribution of and is given by . Consequently, the conditional distribution of given is , where is the marginal density of . Formally, we can thus write the velocity field as the following integral:
This is the formula appearing in Lipman's paper.
Sampling the probability path.
Sampling is easy when we have at our disposal samples from . But when we have only one of them, say , we cannot use this formula, so we have to sample the ODE started at : but this would need knowledge of . That is not directly doable, since its expression needs knowledge of . However, the loss can efficiently be minimized without knowing .
Learning the velocity
Flow Matching Loss.
Let be any function. Then,
where is a constant with respect to .
The practical consequence is that if is smoothly parametrized by , then the -loss
and the Flow-Matching loss
have the same gradients and the same minimizers.
Proof. Develop the square: The last term is equal to
Since averages commute with any linear operator, we can write this as , then we can decondition and get . Going back to the first line, adding and subtracting , we get The last two terms are constants with respect to .
Everything is now tractable. In practice, to learn we use a parametrized family of smooth functions and we minimize for "any" time : we
sample batches from from the coupling ;
sample random times in using a distribution which can be uniform or not;
we compute the conditional flows and the conditional velocities for all the samples of the batch;
we compute the discrepancy for all samples of the batch,
we backpropagate the gradient of this discrepancy to update .
Now that everything is set, we have to design an efficient conditional flow .
The simplest (and, indeed, very powerful) flow is the linear one, , giving
where are differentiable and satisfy and . The trajectories are straight lines going from to at a velocity given by
In the simplest setting and , the velocity is constant, , so that the flow-matching loss minimizes .
In practice, the goal of (most) generative models is to sample from , which leaves open the choice for . The natural choice goes for simplicity, with . In this case, noting instead of , the marginals of (13) are exactly the ones we found for the noising process in diffusions. The conditional velocity is simply
Tweedie's formula, as seen in the preceding note on score matching, gives
Note that is equal to , which is half the time-derivative of the lof-SNR (signel-to-noise) ratio .
Proof. Tweedie's denoising formula says that since is the distribution of noised by , then
Similarly,
Gathering the two yields the formula.
Formula (16) tells us (once again!) that the only thing that matters given the choice of is the knowledge of the score . How this score was learnt is actually a secondary problem: we don't care if the learning was done in a diffusion framework or whatever. When we have it, we can plug it in (16) and sample.
By carefully looking at the derivations, we can see that there are one-to-one linear connections between
(the score)
(the denoising model)
(the data prediction model)
(the velocity model).
Only one of them needs to be learned, and we can convert it into the three others. If one wants to use a pretrained model for sampling, one thus needs to keep track of how the model was learned: is it a score, a denoiser, a data predictor or a velocity?
We close this part with an important note: given a score model for , regardless of how it is formulated and trained, using it to sample from the flow or to sample from the DDIM ODE in diffusion models is exactly the same.
There is room for the choice of the connection ; among these choices, some of them should be better than the others. The choice of one of these connections provides a velocity field transporting samples from to samples from . Consider all the possibles fields such that the solutions of the ODE started at and at . The best possible one should minimize the total kinetic energy,
where the expection is over . This problem is widely studied and can be solved: it is quite intuitive that the optimal flows are straight lines, so the optimal velocity fields are constants, and indeed the flows are given by where is Brenier's potentiel, the unique (under some conditions) map such that has distribution and which minimizes the square transport distance . Of course, computing is in general intractable.
This solves the unconditional OT problem, and we need conditional flows. One nice trick goes as follows: using Jensen's inequality and de-conditioning, we can write
This last bound is the expected kinetic energy of the conditional flow over the boundary distributions . For each realization , we can try to find the optimal transport map which minimizes subject to the boundary conditions and . The solution is obvisouly the straight line (this can be found formally using the Euler-Lagrange conditions).
The conclusion of these considerations is as follows.
It might not be a minimizer of the Kinetic Energy itself, but it is a good starting point. In any ways, it is important to keep in mind that flowing straight between two points is absolutely not the same as flowing straight between two distributions.
Up to now, we've seen how flow matching reformulates and simplifies diffusion sampling. There remains a problem: in practive, to sample from these models, we need to solve an ODE or a SDE, which in practive is done by discretizing time and using a scheme like Euler or Runge-Kutta. But for every time step, this needs a feedforward evaluation of a neural network. Most schemes need 50 time steps, which is why sampling from diffusions can be long.
Consistency models try to directly learn the flow and not the velocity - and yes, the naming should have been velocity matching from the beginning.
The three seminal papers who found (independently) the Flow Matching formulation of generative diffusion models are
The original Flow Matching paper by Lipman et al.
The Probability flow for FP and its rewriting Stochastic interpolants by Albergo, Boffi and Vanden-Eijnden.
The Rectified Flow paper by Liu, Gong and Liu.
I find the stochastic interpolant formulation way cleaner and nicer than Lipman's one.
Since then, there were many surveys on the topics. I pretty much like META's excellent survey of Flow Matching by Lipman and coauthors, for its depths and variety. There is also this nice blog post on the topic by Mathurin Massias and others, and this very recent one by people at DeepMind, clarifying the link between diffusions and FM.
These excellent slides by Brandon Amos present a history of the topic and how we evolved from diffusions to flows through neural ODEs and normalizing flows.
Training FM « at scale », by the Stability team, who in my knowledge were the first ones to scale FM training. They later produced the FLUX family of models.
Brenier's theorem, a little bit mathy.