Flow models V: consistency, flow maps and distillation

March 2025

Let us summarize what we've seen. Given two p0,p1p_0, p_1, we found a way to define an ODE X˙t=vt(Xt)\dot{X}_t = v_t(X_t) such that X0p0X_0 \sim p_0 and X1p1X_1 \sim p_1. Score-matching techniques allow us to learn vtv_t from samples. Once this is done, we can sample from p1p_1 by solving the ODE. Alternatively, there is an SDE whose marginals are also ptp_t and whose drift is related to vtv_t, so we could also sample by sampling from this SDE.

In practice, we can't solve the ODE or SDE exactly, so we discretize it using, say, NN time steps t1<<tNt_1 < \dotsb < t_N, using schemes like Runge-Kutta for the ODE, or Euler-Maruyama for the SDE. But then, we need NN evaluations of the neural network that learnt vtv_t, which hinders a lot the speed of the sampling processes. As a result, big models like FLUX.1 can take a few tens of seconds (on a 40GB GPU) to sample a single image.

The ODE X˙t=vt(Xt)\dot{X}_t = v_t(X_t) started at x0x_0 produces a flow Ψt(x0)\Psi_t(x_0), which is the solution of the ODE at time tt starting at x0x_0. Is it possible to directly learn this flow map Ψt\Psi_t with a single neural network?

Learning the flow map

The ODE 

x˙t=vt(xt)\dot{x}_t = v_t(x_t)

started at x0p0x_0 \sim p_0 has density at time tt noted ptp_t. This probability path solves the continuity equation

tpt=(vtpt). \partial_t p_t = - \nabla \cdot (v_t p_t).

There is a unique map, called the flow of the ODE and noted Ψs,t:RdRd\Psi_{s,t}:\mathbb{R}^d \to \mathbb{R}^d, such that for any two time steps s,ts,t one has

Ψs,t(xs)=xt.\Psi_{s,t}(x_s) = x_t.

This map is the unique solution of the family of ODEs started at time ss at any point xx: tΨs,t(x)=vt(Ψs,t(x)) \partial_t \Psi_{s,t}(x) = v_t(\Psi_{s,t}(x)) It satisfies the consistency condition Ψs,tΨr,s=Ψr,t\Psi_{s,t}\circ \Psi_{r,s} = \Psi_{r,t}, for any r,s,tr,s,t.

Proof. Taking time-tt derivatives of the identity Ψs,t(xs)=xt\Psi_{s,t}(x_s) = x_t gives tΨs,t(xs)=vt(Ψs,t(xs))\partial_t \Psi_{s,t}(x_s) = v_t(\Psi_{s,t}(x_s)), which is (4). This is the ODE (1) with initial condition xs=Ψs,s(xs)x_s = \Psi_{s,s}(x_s). The uniqueness of the solution of ODEs gives the result. The consistency equation comes from the simple fact that Ψs,t(Ψr,s(xr))=Ψs,t(xs)=xt=Ψr,s(xr)\Psi_{s,t}(\Psi_{r,s}(x_r)) = \Psi_{s,t}(x_s) = x_t = \Psi_{r,s}(x_r). Solving the ODE started at time rr at any point xr=xx_r = x gives the general identity Ψs,t(Ψr,s(x))=Ψr,t(x)\Psi_{s,t}(\Psi_{r,s}(x)) = \Psi_{r,t}(x).

In short, Ψs,t\Psi_{s,t} solves the ODE (1) with initial condition xsx_s. The trajectory started at x0x_0 can thus be expressed as xs=Ψ0,s(x0)x_s = \Psi_{0,s}(x_0). In particular it can be used to travel back or forth along the trajectories, since Ψt,s=Ψs,t1\Psi_{t,s} = \Psi_{s,t}^{-1}.

It turns out that the flow map can be learnt directly from the knowledge of the velocity field vtv_t.

The flow map Ψs,t\Psi_{s,t} is the minimizer of the square loss

L(S)=01tSs,t(x)vt(Ss,t(x))2ps(x)dxdsdtL(S) = \iint_0^1 | \partial_t S_{s,t}(x) - v_t(S_{s,t}(x))|^2 p_s(x) dx ds dt

among all maps Ss,t:RdRdS_{s,t}:\mathbb{R}^d \to \mathbb{R}^d that are differentiable in tt and satisfy Ss,s(x)=xS_{s,s}(x) = x.

Proof. It is an evidence that if pt(x)>0p_t(x)>0 everywhere (which we implicitly assume), then the loss above is minimized (and equal to 0) only when tSs,t=vt(Ss,t)\partial_t S_{s,t} = v_t(S_{s,t}), that is, when Ss,tS_{s,t} satisfies equation (4). Unicity in the preceding theorem implies that the minimizer is Ψs,t\Psi_{s,t}.

Now, suppose that we have at our disposal a diffusion model or a flow matching model. By parametrizing a family of functions Ss,tS_{s,t}, with a neural network, we can minimize the loss L(S)L(S). Since vtv_t is already a neural network, this is a kind of distillation, where we train a neural network to mimic the behavior of another neural network. In the end, we obtain a proxy Ss,tS_{s,t} of the flow map Ψs,t\Psi_{s,t}.

Few-steps sampling

The learnt flow map SS can be used in various ways to sample from p1p_1.

References

Although present earlier in the litterature, the first paper to systematically distillate diffusion models was the Consistency models paper. The Flow Map Matching interprets it and generalizes it in the stochastic interpolant framework, which is the one I follow here. Very recently, the Inductive Moment Matching technique was designed to be even more efficient.