The Kullback-Leibler divergence between Gaussians

June 2022

For two probability measures $\mathbb{P}, \mathbb{Q}$ supported on $\mathbb{R}^d$ and with densities $p,q$ with respect to the Lebesgue measure, the Kullback-Leibler divergence between them is defined as

\mathrm{kl}(\mathbb{P}\Vert \mathbb{Q}) = \mathbb{E}_{X \sim \mathbb{P}}\left[ \ln \left(\frac{p(X)}{q(X)}\right)\right] = \int_{\mathbb{R}^d} p(x)\ln(p(x)) - p(x)\ln(q(x))\mathrm{d}x.

Reminders on the $\mathrm{kl}$ divergence.

If $f$ is a density function, the « relative entropy of $p$ with respect to $f$ » is the nonnegative quantity defined as

H_p(f)=-\int p(x) \ln f(x)\mathrm{d}x.

Information theory à la Shannon tells us that this is the mean cost of « encoding » random variables drawn for $p$ using the density $f$ . This cost is minimized for $h=p$ and the minimal cost is $H_p(p)$ , the entropy of $p$ – that's Shannon's theorem. The Kullback-Leibler divergence is thus the difference $H_p(f) - H_p(p)$ ; in other words, it quantifies what is lost when encoding $p$ with $q$ , or in other words what quantity of information on $p$ is not contained in $q$ .

The KL divergence between two Gaussian distributions

In dimension $d$ , the Gaussian distribution $\mathscr{N}(\mu, \Sigma)$ with mean $\mu$ and covariance $\Sigma$ (a $d\times d$ positive, nonsingular matrix) is given by

g_{\mu, \Sigma}(x) = \frac{1}{\sqrt{(2\pi)^d |\Sigma|}}\exp\left\lbrace - \frac{\langle x- \mu, \Sigma^{-1}(x-\mu)\rangle }{2}\right\rbrace

where $|\Sigma|$ is the determinant of the matrix $\Sigma$ . The point of this note is the following formula –- no one remembers it and I always have to google it myself.

\mathrm{kl}(\mathscr{N}(\mu_1, \Sigma_1)\Vert \mathscr{N}(\mu_2, \Sigma_2)) = \frac{1}{2}\ln |\Sigma_2\Sigma_1^{-1}| - \frac{d}{2} + \frac{1}{2}\mathrm{trace}(\Sigma_1 \Sigma_2^{-1}) + \frac{1}{2}\langle \mu_2 - \mu_1, \Sigma_2^{-1}(\mu_2-\mu_1)\rangle.

The proof, if someones needs it

We'll note $p = g_{\mu_1, \Sigma_1}$ and $q=g_{\mu_2, \Sigma_2}$ , so that

\mathrm{kl}(\mathscr{N}(\mu_1, \Sigma_1)\Vert \mathscr{N}(\mu_2, \Sigma_2)) = \mathbb{E}[\ln(p(X)/q(X))]

where $X \sim \mathscr{N}(\mu_1, \Sigma_1)$ . From the definitions, $\ln p(x)/q(x)$ is equal to

\begin{aligned} \frac{\ln |\Sigma_2| - \ln |\Sigma_1|}{2} - \frac{1}{2}\langle x-\mu_1, \Sigma_1^{-1}(x-\mu_1)\rangle + \frac{1}{2}\langle (x-\mu_2), \Sigma_2^{-1}(x-\mu_2)\rangle . \end{aligned}

We recall that for any vector $x\in\mathbb{R}^d$ and matrix $M$ , we can write $\langle x, Mx\rangle = \mathrm{trace}(xx^\top M)$ ; moreover, we recall that

expectations can be swapped with linear maps, ie if $\ell : \mathbb{R}^d \to \mathbb{R}$ is linear then $\mathbb{E}[\ell(X)] = \ell(\mathbb{E}[X])$ ,
if $X \sim \mathscr{N}(\mu_1, \Sigma_1)$ then $\mathbb{E}[(x-\mu_1)(x-\mu_1)^\top] = \Sigma_1$ .

Consequently,

\begin{aligned}\mathbb{E}[\langle x-\mu_1, \Sigma_1^{-1}(x-\mu_1)\rangle] &= \mathbb{E}[\mathrm{trace}((x-\mu_1)(x-\mu_1)^\top \Sigma_1^{-1})] \\ &= \mathrm{trace}(\mathbb{E}[(x-\mu_1)(x-\mu_1)^\top] \Sigma_1^{-1})\\ &= \mathrm{trace}(\Sigma_1 \Sigma_1^{-1}) \\&= d.\end{aligned}

For the second term in (6), since $X-\mu_1$ is centered we note that $\mathbb{E}[(x-\mu_2)(x-\mu_2)^\top] = \Sigma_1 + (\mu_2 - \mu_1)(\mu_2-\mu_1)^\top$ , so that

\begin{aligned}\mathbb{E}[\langle x-\mu_2, \Sigma_2^{-2}(x-\mu_2)\rangle] &= \mathrm{trace}(\Sigma_1^{-1}\Sigma_2) + \langle \mu_2 - \mu_1, \Sigma_2^{-1}(\mu_2-\mu_1)\rangle.\end{aligned}

Gathering everything into (6) we get exactly (4).

Simon Coste

The Kullback-Leibler divergence between Gaussians

The KL divergence between two Gaussian distributions

The proof, if someones needs it