Generative models are often presented as unconditional models, which means that they are trained to generate samples from a distribution on, say, .
However, in practice, it is of paramount importance to generate conditioned distributions: we do not want to generate images out of the blue, but rather images fitting a description (called prompt), like an image of a dog wearing a hat or whatever. Formally, there is an underlying joint distribution over couples where is a sample (images, text, sound, videos) and is a conditioning information: it can be a text description, a visual shape, a color palette, whatever. Our goal is to learn to sample , the distribution of conditioned on . This is called « guidance »; it has been investigated since the beginning of generative models.
The first papers on diffusions had a method for that called classifier guidance, but now it is always done using the classifier-free guidance (CFG) technique from Ho and Salimans's paper, a crucial step in the development of generative models.
CFG has been proved empirically to yield very good results, at least way better than the preceding approaches. However, it remains essentially a trick, and its theoretical understanding remains shaky in my opinion.
The noising path will be noted , with the distribution we want to sample, and , the easy-to-sample distribution. The reverse path can be represented as the probability path of an SDE (DDPM sampling) or an ODE (DDIM sampling), both of which needing knowledge of the gradient of the log-density of at each time , . This score is learnt by means of denoising score-matching and approximated by a neural network, say .
We now consider joint distributions of the form . During the noising process, we only inject noise in the sample and keep fixed; we note for the joint distribution of and along the noising path. The unconditional marginal distribution of is
Bayes formula says that
The gradient of the log of this conditional distribution is therefore
The second term here has already been learnt and approximated by ; if we want to sample from , we thus need access to the first part.
Indeed, this first part can be seen as the gradient of a classifier: is precisely the optimal classifier of a sample . Consequently, if we have at our disposal any pretrained classifier of , we can use it to guide the generation of along the noising path: to do so, simply replace the neural network by the « guided » version
This technique is called classifier guidance. Its main advantage is that once we have learnt the unconditional score , we can adapt this to conditional sampling using any classifier (and there are lots of open-source classifiers available for many problems).
Practitioners noted that using (4) as it is could be impractical, and rescaling the classifier by a factor could improve quality and diversity:
Having a large (say, larger than 1) « strengthens » the influence of the conditioning on along the generation. The intuition behind that is that the above function is the score of a distribution proportional to
The parameter is akin to an inverse temperature in statistical physics; augmenting it "peaks" around its modes, thus promoting adherence to the condition .
Unfortunately, this strategy needs to have a good classifier that works even for large , when the sample is extremely noisy. It can be extremely hard to extract classifying information from an image that is so noisy that it is almost a Gaussian: this is why plugging a pre-trained classifier is generally a bad idea. We would need to re-train a classifier explicitly on noisy data, with noise at different scales.
Classifier-free guidance takes a step back at this problem and also tries to learn the conditional distributions during training, instead of just learning .
The whole training process needs to be adapted. The neural network now has two inputs: the approximation of is . Note that the space where lives can be extended with a dummy element , so that is just the approximation of the unconditional score . In practice, this is done by choosing an « unconditional training proportion » generally 10%; during training, 10% of the samples will be assigned the dummy label , the rest are assigned the correct conditioning information . This trick allows to learn at the same time the conditional distribution and the unconditional one.
Once this is done, Bayes' formula provides us (for free!) a classifier, since
This is approximated by . Using -guidance as before, the gradient used for the sampling path becomes
which can be further simplified.
Classifier-Free Guidance consists in using the -rescaled score at sampling:
Experimentally, this technique allows a tradeoff between quality and variety: augmenting the CFG scale from 0 to (say) 10 augments the variety of the conditional samples (measured by the Inception Score, IS), but reduces the perceptual quality (measured by the Fréchet Inception Distance, FID).
A common intuition found in papers is that sampling from (10) amounts to sample from the score of the probability distribution proportional to
However, as examined in Bradley and Nakkiran's paper, this intuition is wrong. It would be true if corresponded to the time- noising of the distribution , but that is mathematically not the case.
In addition, if this score was the score of a valid diffusion process, choosing ODE sampling or SDE sampling should not matter too much, since both processes would have the same marginals. This is also mathematically wrong: it is very easy to check that in a Gaussian setting where and , a setting where we have access to the exact score , the ODE and SDE samplings with -rescaling and conditioning on lead to radically different distributions:
for the ODE, it samples ;
for the SDE, it samples .
The real distribution of given is .
At the moment of writing of this post (march 25), it really remains unclear (at least, to me) what distribution is sampled using the CFG technique above. Something is lacking in our understanding of CFG.
We simply replace the score with the velocity field.
In this case, we need to learn a « joint velocity field » : the entire ODE system is conditionnally on . In practice, this means that the conditional flows/velocities are actually conditioned twice: one on the conditioning information , and one on the final sample itself . Note that here again, can be a placeholder meaning that the flow is unconditional. Once the approximation is learnt, we can mimick the strategy (10): at sampling, we use the velocity
Diffusion beat GANs really pushed forward classifier guidance.
Classifier-free diffusion guidance was the first paper to introduce CFG.
CFG is a predictor-corrector, a nice, recent (oct 25) review on CFG.
What does guidance do ?, another recent paper (sep 25) on the topic.