Let us summarize what we've seen. Given two , we found a way to define an ODE such that and . Score-matching techniques allow us to learn from samples. Once this is done, we can sample from by solving the ODE. Alternatively, there is an SDE whose marginals are also and whose drift is related to , so we could also sample by sampling from this SDE.
In practice, we can't solve the ODE or SDE exactly, so we discretize it using, say, time steps , using schemes like Runge-Kutta for the ODE, or Euler-Maruyama for the SDE. But then, we need evaluations of the neural network that learnt , which hinders a lot the speed of the sampling processes. As a result, big models like FLUX.1 can take a few tens of seconds (on a 40GB GPU) to sample a single image.
The ODE
started at has density at time noted . This probability path solves the continuity equation
There is a unique map, called the flow of the ODE and noted , such that for any two time steps one has
This map is the unique solution of the family of ODEs started at time at any point : It satisfies the consistency condition , for any .
In short, solves the ODE (1) with initial condition . The trajectory started at can thus be expressed as . In particular it can be used to travel back or forth along the trajectories, since .
It turns out that the flow map can be learnt directly from the knowledge of the velocity field .
The flow map is the minimizer of the square loss
among all maps that are differentiable in and satisfy .
Now, suppose that we have at our disposal a diffusion model or a flow matching model. By parametrizing a family of functions , with a neural network, we can minimize the loss . Since is already a neural network, this is a kind of distillation, where we train a neural network to mimic the behavior of another neural network. In the end, we obtain a proxy of the flow map .
The learnt flow map can be used in various ways to sample from .
for super fast sampling, we can directly sample and apply to get a sample from . Indeed, the first « flow map learning » method did directly learn for all and was called consistency distillation. However, it can happen that this map is not accurate: it is quite intuitive that learning how to go from to is easier than going directly from to , so it can happen that is not very accurate.
One can trade speed for accuracy by using the full flow map for a few time steps: typically, using . This needs two more feedforward passes but the resulting is probably closer to the real solution of (1) started at .
Although present earlier in the litterature, the first paper to systematically distillate diffusion models was the Consistency models paper. The Flow Map Matching interprets it and generalizes it in the stochastic interpolant framework, which is the one I follow here. Very recently, the Inductive Moment Matching technique was designed to be even more efficient.