The Kullback-Leibler divergence between Gaussians

June 2022

For two probability measures P,Q\mathbb{P}, \mathbb{Q} supported on Rd\mathbb{R}^d and with densities p,qp,q with respect to the Lebesgue measure, the Kullback-Leibler divergence between them is defined as

kl(PQ)=EXP[ln(p(X)q(X))]=Rdp(x)ln(p(x))p(x)ln(q(x))dx. \mathrm{kl}(\mathbb{P}\Vert \mathbb{Q}) = \mathbb{E}_{X \sim \mathbb{P}}\left[ \ln \left(\frac{p(X)}{q(X)}\right)\right] = \int_{\mathbb{R}^d} p(x)\ln(p(x)) - p(x)\ln(q(x))\mathrm{d}x.

Reminders on the kl\mathrm{kl} divergence.

If ff is a density function, the « relative entropy of pp with respect to ff » is the nonnegative quantity defined as

Hp(f)=p(x)lnf(x)dx.H_p(f)=-\int p(x) \ln f(x)\mathrm{d}x.

Information theory à la Shannon tells us that this is the mean cost of « encoding » random variables drawn for pp using the density ff. This cost is minimized for h=ph=p and the minimal cost is Hp(p)H_p(p), the entropy of pp – that's Shannon's theorem. The Kullback-Leibler divergence is thus the difference Hp(f)Hp(p)H_p(f) - H_p(p); in other words, it quantifies what is lost when encoding pp with qq, or in other words what quantity of information on pp is not contained in qq.

The KL divergence between two Gaussian distributions

In dimension dd, the Gaussian distribution N(μ,Σ)\mathscr{N}(\mu, \Sigma) with mean μ\mu and covariance Σ\Sigma (a d×dd\times d positive, nonsingular matrix) is given by

gμ,Σ(x)=1(2π)dΣexp{xμ,Σ1(xμ)2} g_{\mu, \Sigma}(x) = \frac{1}{\sqrt{(2\pi)^d |\Sigma|}}\exp\left\lbrace - \frac{\langle x- \mu, \Sigma^{-1}(x-\mu)\rangle }{2}\right\rbrace

where Σ|\Sigma| is the determinant of the matrix Σ\Sigma. The point of this note is the following formula –- no one remembers it and I always have to google it myself.

kl(N(μ1,Σ1)N(μ2,Σ2))=12lnΣ2Σ11d2+12trace(Σ1Σ21)+12μ2μ1,Σ21(μ2μ1). \mathrm{kl}(\mathscr{N}(\mu_1, \Sigma_1)\Vert \mathscr{N}(\mu_2, \Sigma_2)) = \frac{1}{2}\ln |\Sigma_2\Sigma_1^{-1}| - \frac{d}{2} + \frac{1}{2}\mathrm{trace}(\Sigma_1 \Sigma_2^{-1}) + \frac{1}{2}\langle \mu_2 - \mu_1, \Sigma_2^{-1}(\mu_2-\mu_1)\rangle.

The proof, if someones needs it

We'll note p=gμ1,Σ1p = g_{\mu_1, \Sigma_1} and q=gμ2,Σ2q=g_{\mu_2, \Sigma_2}, so that

kl(N(μ1,Σ1)N(μ2,Σ2))=E[ln(p(X)/q(X))] \mathrm{kl}(\mathscr{N}(\mu_1, \Sigma_1)\Vert \mathscr{N}(\mu_2, \Sigma_2)) = \mathbb{E}[\ln(p(X)/q(X))]

where XN(μ1,Σ1)X \sim \mathscr{N}(\mu_1, \Sigma_1). From the definitions, lnp(x)/q(x)\ln p(x)/q(x) is equal to

lnΣ2lnΣ1212xμ1,Σ11(xμ1)+12(xμ2),Σ21(xμ2).\begin{aligned} \frac{\ln |\Sigma_2| - \ln |\Sigma_1|}{2} - \frac{1}{2}\langle x-\mu_1, \Sigma_1^{-1}(x-\mu_1)\rangle + \frac{1}{2}\langle (x-\mu_2), \Sigma_2^{-1}(x-\mu_2)\rangle . \end{aligned}

We recall that for any vector xRdx\in\mathbb{R}^d and matrix MM, we can write x,Mx=trace(xxM)\langle x, Mx\rangle = \mathrm{trace}(xx^\top M); moreover, we recall that

Consequently,

E[xμ1,Σ11(xμ1)]=E[trace((xμ1)(xμ1)Σ11)]=trace(E[(xμ1)(xμ1)]Σ11)=trace(Σ1Σ11)=d.\begin{aligned}\mathbb{E}[\langle x-\mu_1, \Sigma_1^{-1}(x-\mu_1)\rangle] &= \mathbb{E}[\mathrm{trace}((x-\mu_1)(x-\mu_1)^\top \Sigma_1^{-1})] \\ &= \mathrm{trace}(\mathbb{E}[(x-\mu_1)(x-\mu_1)^\top] \Sigma_1^{-1})\\ &= \mathrm{trace}(\Sigma_1 \Sigma_1^{-1}) \\&= d.\end{aligned}

For the second term in (6), since Xμ1X-\mu_1 is centered we note that E[(xμ2)(xμ2)]=Σ1+(μ2μ1)(μ2μ1)\mathbb{E}[(x-\mu_2)(x-\mu_2)^\top] = \Sigma_1 + (\mu_2 - \mu_1)(\mu_2-\mu_1)^\top, so that

E[xμ2,Σ22(xμ2)]=trace(Σ11Σ2)+μ2μ1,Σ21(μ2μ1).\begin{aligned}\mathbb{E}[\langle x-\mu_2, \Sigma_2^{-2}(x-\mu_2)\rangle] &= \mathrm{trace}(\Sigma_1^{-1}\Sigma_2) + \langle \mu_2 - \mu_1, \Sigma_2^{-1}(\mu_2-\mu_1)\rangle.\end{aligned}

Gathering everything into (6) we get exactly (4).