Flow Models IV: What is Classifier-Free Guidance?

March 2025

Generative models are often presented as unconditional models, which means that they are trained to generate samples from a distribution pp on, say, Rd\mathbb{R}^d.

However, in practice, it is of paramount importance to generate conditioned distributions: we do not want to generate images out of the blue, but rather images fitting a description (called prompt), like an image of a dog wearing a hat or whatever. Formally, there is an underlying joint distribution p(x,c)p(x, c) over couples where xx is a sample (images, text, sound, videos) and cc is a conditioning information: it can be a text description, a visual shape, a color palette, whatever. Our goal is to learn to sample p(xc)p(x \mid c), the distribution of xx conditioned on cc. This is called « guidance »; it has been investigated since the beginning of generative models.

The first papers on diffusions had a method for that called classifier guidance, but now it is always done using the classifier-free guidance (CFG) technique from Ho and Salimans's paper, a crucial step in the development of generative models.

CFG has been proved empirically to yield very good results, at least way better than the preceding approaches. However, it remains essentially a trick, and its theoretical understanding remains shaky in my opinion.

Diffusions redux

The noising path will be noted ptp_t, with p0p_0 the distribution we want to sample, and pTN(0,Id)p_T \approx N(0,I_d), the easy-to-sample distribution. The reverse path qt=pTtq_t = p_{T-t} can be represented as the probability path of an SDE (DDPM sampling) or an ODE (DDIM sampling), both of which needing knowledge of the gradient of the log-density of ptp_t at each time tt, xlnpt()\nabla_x \ln p_t(\cdot). This score is learnt by means of denoising score-matching and approximated by a neural network, say st()s_t(\cdot).

Guidance

We now consider joint distributions of the form p(x,c)p(x,c). During the noising process, we only inject noise in the sample xx and keep cc fixed; we note pt(x,c)p_t(x,c) for the joint distribution of xx and cc along the noising path. The unconditional marginal distribution of xx is

pt(x)=pt(x,c)dc.p_t(x) = \int p_t(x,c) dc.

Bayes formula says that

pt(xc)=pt(cx)pt(x)pt(c).p_t(x \mid c) = \frac{p_t(c \mid x) p_t(x)}{ p_t(c)}.

The gradient of the log of this conditional distribution is therefore

xlnpt(xc)=xlnpt(cx)+xlnpt(x).\nabla_x \ln p_t(x \mid c) = \nabla_x \ln p_t(c \mid x) + \nabla_x \ln p_t(x).

The second term here has already been learnt and approximated by sts_t; if we want to sample from tt(c)t_t(\cdot \mid c), we thus need access to the first part.

Classifiers

Indeed, this first part can be seen as the gradient of a classifier: pt(cx)p_t(c \mid x) is precisely the optimal classifier of a sample xx. Consequently, if we have at our disposal any pretrained classifier of xx, we can use it to guide the generation of xx along the noising path: to do so, simply replace the neural network sts_t by the « guided » version

st(x)+xlnpt(cx).s_t(x) + \nabla_x \ln p_t(c \mid x).

This technique is called classifier guidance. Its main advantage is that once we have learnt the unconditional score lnpt\nabla \ln p_t, we can adapt this to conditional sampling using any classifier (and there are lots of open-source classifiers available for many problems).

Scaled guidance

Practitioners noted that using (4) as it is could be impractical, and rescaling the classifier by a factor γ\gamma could improve quality and diversity:

st(x)+γxlnpt(cx). s_t(x) + \gamma \nabla_x \ln p_t(c \mid x).

Having a large γ\gamma (say, larger than 1) « strengthens » the influence of the conditioning on cc along the generation. The intuition behind that is that the above function is the score of a distribution proportional to

pt(x)pt(cx)γ.p_t(x)p_t(c \mid x)^\gamma.

The parameter γ\gamma is akin to an inverse temperature in statistical physics; augmenting it "peaks" pt(cx)p_t(c \mid x) around its modes, thus promoting adherence to the condition cc.

Limitations

Unfortunately, this strategy needs to have a good classifier pt(cx)p_t(c \mid x) that works even for large tt, when the sample xx is extremely noisy. It can be extremely hard to extract classifying information from an image that is so noisy that it is almost a Gaussian: this is why plugging a pre-trained classifier is generally a bad idea. We would need to re-train a classifier explicitly on noisy data, with noise at different scales.

Classifier-free guidance

Classifier-free guidance takes a step back at this problem and also tries to learn the conditional distributions pt(xc)p_t(x \mid c) during training, instead of just learning pt(x)p_t(x).

The whole training process needs to be adapted. The neural network now has two inputs: the approximation of pt(xc)p_t(x \mid c) is st(x,c)s_t(x, c). Note that the space where cc lives can be extended with a dummy element \varnothing, so that st(x,)s_t(x, \varnothing) is just the approximation of the unconditional score lnpt(x)\nabla \ln p_t(x). In practice, this is done by choosing an « unconditional training proportion » generally 10%; during training, 10% of the samples will be assigned the dummy label \varnothing, the rest are assigned the correct conditioning information cc. This trick allows to learn at the same time the conditional distribution and the unconditional one.

Once this is done, Bayes' formula provides us (for free!) a classifier, since

xlnpt(cx)=xlnpt(xc)xlnpt(x).\nabla_x \ln p_t(c \mid x) = \nabla_x \ln p_t(x \mid c) - \nabla_x \ln p_t(x).

This is approximated by st(x,y)st(x,)s_t(x,y) - s_t(x, \varnothing). Using γ\gamma-guidance as before, the gradient used for the sampling path becomes

st(x,)+γ(st(x,c)st(x,))s_t(x, \varnothing) + \gamma (s_t(x,c) - s_t(x, \varnothing))

which can be further simplified.

Classifier-Free Guidance consists in using the γ\gamma-rescaled score at sampling:

(1γ)st(x,)+γst(x,c), (1 - \gamma)s_t(x, \varnothing) + \gamma s_t(x, c),

This score is an approximation of (1γ)xlnpt(x)+γxlnpt(xc).(1-\gamma)\nabla_x \ln p_t(x) + \gamma \nabla_x \ln p_t(x \mid c).

Experimentally, this technique allows a tradeoff between quality and variety: augmenting the CFG scale γ\gamma from 0 to (say) 10 augments the variety of the conditional samples (measured by the Inception Score, IS), but reduces the perceptual quality (measured by the Fréchet Inception Distance, FID).

The gamma-powered distribution

A common intuition found in papers is that sampling from (10) amounts to sample from the score of the probability distribution proportional to

pt(x)1γpt(xc)γ.p_t(x)^{1-\gamma}p_t(x \mid c)^\gamma.

However, as examined in Bradley and Nakkiran's paper, this intuition is wrong. It would be true if pt(x)γpt(xc)1γp_t(x)^\gamma p_t(x \mid c)^{1-\gamma} corresponded to the time-tt noising of the distribution pt(x)γpt(xc)1γp_t(x)^\gamma p_t(x \mid c)^{1-\gamma}, but that is mathematically not the case.

In addition, if this score was the score of a valid diffusion process, choosing ODE sampling or SDE sampling should not matter too much, since both processes would have the same marginals. This is also mathematically wrong: it is very easy to check that in a Gaussian setting where p(xc)=N(c,1)p(x \mid c) = N(c, 1) and cN(0,1)c \sim N(0, 1), a setting where we have access to the exact score lnpt(xc)\nabla \ln p_t(x \mid c), the ODE and SDE samplings with γ\gamma-rescaling and conditioning on c=0c=0 lead to radically different distributions:

At the moment of writing of this post (march 25), it really remains unclear (at least, to me) what distribution is sampled using the CFG technique above. Something is lacking in our understanding of CFG.

How is guidance done for Flow Matching models ?

We simply replace the score xlnpt\nabla_x \ln p_t with the velocity field.

In this case, we need to learn a « joint velocity field » vt(x,c)v_t(x,c): the entire ODE system is conditionnally on cc. In practice, this means that the conditional flows/velocities are actually conditioned twice: one on the conditioning information cc, and one on the final sample itself X1p1(c)X_1 \sim p_1(\cdot \mid c). Note that here again, cc can be a placeholder \varnothing meaning that the flow is unconditional. Once the approximation utθ(x,c)u^\theta_t(x,c) is learnt, we can mimick the strategy (10): at sampling, we use the velocity

(1γ)utθ(x,)+γutθ(x,c). (1-\gamma) u^\theta_t(x, \varnothing) + \gamma u^\theta_t(x, c).

References

Diffusion beat GANs really pushed forward classifier guidance.

Classifier-free diffusion guidance was the first paper to introduce CFG.

CFG is a predictor-corrector, a nice, recent (oct 25) review on CFG.

What does guidance do ?, another recent paper (sep 25) on the topic.