For two probability measures supported on and with densities with respect to the Lebesgue measure, the Kullback-Leibler divergence between them is defined as
Reminders on the divergence.
If is a density function, the « relative entropy of with respect to » is the nonnegative quantity defined as
Information theory à la Shannon tells us that this is the mean cost of « encoding » random variables drawn for using the density . This cost is minimized for and the minimal cost is , the entropy of – that's Shannon's theorem. The Kullback-Leibler divergence is thus the difference ; in other words, it quantifies what is lost when encoding with , or in other words what quantity of information on is not contained in .
In dimension , the Gaussian distribution with mean and covariance (a positive, nonsingular matrix) is given by
where is the determinant of the matrix . The point of this note is the following formula –- no one remembers it and I always have to google it myself.
We'll note and , so that
where . From the definitions, is equal to
We recall that for any vector and matrix , we can write ; moreover, we recall that
expectations can be swapped with linear maps, ie if is linear then ,
if then .
For the second term in (6), since is centered we note that , so that
Gathering everything into (6) we get exactly (4).